Neurocomputing 175 (2016) 473–491
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
A density-based noisy graph partitioning algorithm Jaehong Yu, Seoung Bum Kim n Department of Industrial Management Engineering, Korea University, Anam-dong, Seoungbuk-Gu, Seoul 136-713, Republic of Korea
art ic l e i nf o
a b s t r a c t
Article history: Received 8 March 2015 Received in revised form 13 October 2015 Accepted 24 October 2015 Communicated by: Hung-Yuan Chung Available online 9 November 2015
Clustering analysis can facilitate the extraction of implicit patterns in a dataset and elicit its natural groupings without requiring prior classification information. Numerous researchers have focused recently on graph-based clustering algorithms because their graph structure is useful in modeling the local relationships among observations. These algorithms perform reasonably well in their intended applications. However, no consensus exists about which of them best satisfies all the conditions encountered in a variety of real situations. In this study, we propose a graph-based clustering algorithm based on a novel density-of-graph structure. In the proposed algorithm, a density coefficient defined for each node is used to classify dense and sparse nodes. The main structures of clusters are identified through dense nodes and sparse nodes that are assigned to specific clusters. Experiments on various simulation datasets and benchmark datasets were conducted to examine the properties of the proposed algorithm and to compare its performance with that of existing spectral clustering and modularity-based algorithms. The experimental results demonstrated that the proposed clustering algorithm performed better than its competitors; this was especially true when the cluster structures in the data were inherently noisy and nonlinearly distributed. & 2015 Elsevier B.V. All rights reserved.
Keywords: Clustering algorithm Nonlinearity Density coefficient Maximizing connectivity
1. Introduction Modern industrial processes generate an unprecedented wealth of data that overwhelms traditional analytical approaches. Clustering analysis can facilitate the extraction of implicit patterns from these huge datasets and thus elicit their natural groupings. Clustering algorithms systematically partition the dataset by minimizing within-group variation and maximizing betweengroup variation [1]. Clustering analysis has been applied in various fields, such as text mining [2], image segmentation [3], bioinformatics [4], Web mining [5], and manufacturing [6]. Numerous clustering algorithms have been developed [7]. The most prominent of these are k-means [8], density-based spatial clustering of applications with noise (DBSCAN; [9]), and modularity-based clustering [10,11]. Although most of the existing algorithms perform reasonably well within the situations for which they were designed, no consensus exists about which is the best all-around performer in reallife situations. Most existing clustering algorithms perform poorly when the cluster structures inherent in the dataset have nonlinear patterns and different densities [12]. n
Corresponding author. Tel.: þ 82 2 3290 3397; fax: þ 82 2 929 5888. E-mail address:
[email protected] (S.B. Kim).
http://dx.doi.org/10.1016/j.neucom.2015.10.085 0925-2312/& 2015 Elsevier B.V. All rights reserved.
To address these limitations, the technique of transforming from a feature space to a graph space has been adapted to the design of clustering algorithms. By expressing the data as a graph structure, the local relationships between observations can be effectively modeled [13]. In a graph, nodes and edges express the observations and their relationships. In other words, graphs, by their topological nature, are more naturally suited to expressing certain dataset relationships and structures [13,14]. Because of these advantages, graphing techniques have been widely applied in various machine learning areas, such as manifold learning [15], semi-supervised learning [16], and clustering [17–19]. A graph-based clustering algorithm discovers the intrinsic groupings of a dataset by extracting topological information of relative adjacency among observations [20]. A number of graphbased clustering algorithms have been proposed to capitalize on the benefits described above [21]. In graph-based clustering, a subgraph can be considered as a cluster to maximize the intraconnectivity within subgraphs [22,23]. Various objective functions have been proposed to properly discover the clusters in a graph. These include cut [24], ratio cut [25], normalized cut [18,26], and conductance [27]. However, the optimization issues raised by these objective functions are hard to solve because they are nondeterministic polynomial time-hard (NP-Hard) problems [28]. To deal with this computational issue and solve the problem more efficiently, a proposed spectral clustering method eases the
474
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
optimization difficulties by adopting a spectral decomposition technique [29]. Moreover, several variants of these spectral clustering methods have been developed [30]. Among these, an algorithm proposed by Ng et al. [26] is notable. In Ng’s algorithm, the normalized adjacency matrix is computed from the graph structure, and the resulting matrix is partitioned by using spectral decomposition and k-means clustering methods. This algorithm has been widely used because of its simplicity in implementation and outstanding performance in many situations [31]. However, despite its success, this algorithm has several limitations. First, the number of clusters must be determined in advance. This requirement may cause problems, especially when explicit knowledge of the data is not readily available [32]. Furthermore, spectral clustering algorithms do not work well on datasets that contain the noisy clusters common to many real situations [30,33]. In their analyses of graph clustering, many researchers in recent years have focused on modularity-based algorithms [10,34]. Modularity measures the significance of the connection between the nodes within a cluster. High modularity implies that clusters are properly constructed. Modularity is known as an effective measure for examining the adequacy of intrinsic clusters in a graph [35]. However, maximizing modularity often produces unsatisfactory results because it may lead to inadequate partitioning of nonlinear patterns, such as S-curves and Swiss roll shapes [36]. Chameleon [37] and Markov cluster (MCL; [38]) are also wellknown graph-based clustering algorithms. Chameleon algorithm starts with the k-nearest neighborhood graph and partitions the graph structure into large number of small initial clusters. The initial clusters are then merged to preserve to maximize the internal self-similarities of clusters. Chameleon algorithm works well when the clusters have nonlinear patterns and the clusters have different densities [39]. However, this algorithm suffers from the curse of dimensionality in high dimensional data and requires a number of user-defined parameters [12]. MCL partitions the dataset based on the stochastic process. This algorithm constructs the transition matrix from the adjacencies between observations and expands it until this matrix converges. The final cluster is identified from the converged transition matrix [38]. MCL is widely used for graph partitioning due to its effectiveness and robustness against the noises. In addition, this algorithm does not require the number of clusters in advance. In spite of these advantages, MCL might not be suitable for identifying the nonlinear clusters because this algorithm tends to partition the large clusters [40]. In the present study, we propose a novel graph-based clustering algorithm that is especially useful for grouping data exhibiting noisy and nonlinear patterns. To achieve robustness against background noise, the proposed algorithm differentiated between dense and sparse nodes [23]. The proposed algorithm determines the main structure of each cluster in the dense regions of a graph; then the clusters are partitioned by the sparse regions in the graph. The basic concept of the proposed algorithm derives from the density-level set approach [23,41,42]. Two types of noise treatment schemes — rough cluster and exact cluster identification — can be defined in a density-level set approach [23]. In choosing between these two noise treatment schemes, we focus on exact cluster identification because of its robustness against noisy observations. The remainder of this paper is organized as follows. Section 2 introduces our proposed clustering algorithm. Section 3 presents a simulation study to demonstrate the advantages of the proposed algorithm over the existing algorithms. Section 4 reports the results of experiments undertaken with simulated and real data to examine the properties of the proposed algorithm and to compare it with existing graph-based clustering algorithms. Section 5 contains our concluding remarks.
2. Proposed algorithm The proposed density-based noisy graph partition (DENGP) algorithm consists of five main steps: The first is to represent the data as a mutual k-nearest neighbor graph. In this graph, all observations are represented as nodes. Second, the density of each node (called the density coefficient) is computed to determine the dense regions in the graph structure. Having calculated the density coefficients of all nodes, they are then classified either as core nodes or surrounding nodes. Those classified as surrounding nodes are temporarily excluded from the clustering procedure. Third, the core nodes are partitioned into several initial subgroups, and these subgroups are agglomerated to maximize the intraconnectivity within the cluster. In other words, those clusters that are connected with each other are hierarchically merged until no connection between them exists. In the fourth step, the temporarily excluded surrounding nodes are assigned to one of the clusters by a weighted majority voting scheme. Finally, all nodes are examined to see whether they have been properly assigned. If a node has been assigned incorrectly, it is then reassigned to the maximally connected cluster. Fig. 1 shows a graphical illustration of the proposed algorithm. Fig. 1 shows the overall process of the proposed DENGP algorithm with an illustrative dataset containing three clusters. As shown in Fig. 1a, the original dataset is first transformed into a mutual k-nearest neighbor graph structure. The density coefficients of all nodes are then computed, and each node is classified as either a core or a surrounding node based on a given threshold value. Section 2.2 describes the detailed process to determine the appropriate threshold value. After the classification, surrounding nodes are temporarily removed from the graph. In Fig. 1b, the core and surrounding nodes are expressed as pentagrams (five-pointed stars) and diamonds, respectively. As shown in this figure, the clusters are more clearly delineated after the surrounding nodes have been eliminated. The third step groups the core nodes into several cluster structures, as shown in Fig. 1c. This figure illustrates construction from the core of three clusters with nodes that are represented as circles, triangles, and squares. In the next step, the surrounding nodes temporarily removed earlier are assigned to the appropriate cluster, as shown in Fig. 1d. Finally, several incorrectly assigned nodes are reassigned to the appropriate cluster labels. In Fig. 1e, the nodes included in the dashed regions are incorrectly assigned nodes. A more detailed explanation of each step of the proposed algorithm is presented in the following sections. 2.1. Constructing the mutual k-nearest neighbor graph The first step of the proposed algorithm is to represent the data as a graph structure. As mentioned in Section 1, the cluster analysis of nonlinear patterns makes frequent use of representing a dataset as a neighborhood graph structure [43,44]. Several types of neighborhood graph structures exist. These include the ε-nearest neighbor graph, the symmetric k-nearest neighbor graph, and the mutual k-nearest neighbor graph [45]. Of these, the mutual k-nearest neighbor graph is sparser than other graph schemes, a feature that leads to the minimization of noise effects [16]. This makes the boundaries between clusters clearer [12]. Hence, in this study, we use the mutual k-nearest neighbor graph to group the data. The definition of the mutual k-nearest neighborhood graph is as follows: Definition 1. : Mutual k-Nearest Neighbor Graph. A mutual knearest neighbor-based graph with n nodes is constructed as follows. An edge, eij, between node i and j is defined as: 1 if xi A KðjÞ and xj A KðiÞ eij ¼ : ð1Þ 0 otherwise
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
475
Fig. 1. Graphical illustration for the proposed clustering algorithm with example dataset: (a) Constructing mutual k-nearest neighbor graph from the dataset, (b) Classifying the nodes using density coefficient, (c) Partitioning the core nodes into three subgroups, (d) Assigning the surrounding nodes, and (e) Reassigning incorrectly assigned nodes.
476
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
Eq. (1) denotes the definition of the mutual k-nearest neighborhood graph. Here, K(i) is the k-nearest neighborhood set of an observation xi. According to Eq. (1), an edge is created if and only if the observation xi belongs to K(j), and the observation xj belongs to K(i) at the same time. In the mutual k-nearest neighbor graph, neighborhoods are defined when two nodes are linked with an edge. The following equation denotes the neighborhood of observation xi, N(i). NðiÞ ¼ jeij ¼ 1 : ð2Þ The following step uses the neighborhood to characterize the nodes as a density. 2.2. Characterization of nodes The second step of the proposed algorithm calculates the densities of the nodes to identify dense and sparse regions. This calculation first requires quantification of the relationship between nodes by a weighting of their edges. Although a Gaussian kernel function is a popular measure to represent the weight between two nodes [26], it does not work well when the local patterns of clusters are different. To overcome this limitation, Zelnik-Manor and Perona [46] proposed a locally scaled similarity measure that is defined by the following equation: 0 1 dðxi ; xj Þ2 @ A; wij ¼ exp ð3Þ d xi ; xki d xj ; xkj where d(xi,xj) is the distance between observations xi and xj, and xki is the kth nearest neighbor of observation xi. As for the distance between nodes, any distance metrics can be used. The similarity between two observations is scaled along with their neighboring structures. By doing so, this measure adequately reflects the local pattern of the clusters [47]. However, this measure may be appropriately modified to describe the relationships between nodes in the mutual k-nearest neighbor graph. The modified weight between the nodes i and j is defined as follows: Definition 2. Locally scaled similarity for mutual k-nearest neighbor graph.
8 < exp dðxi ;xj Þ2 if eij ¼ 1 NðiÞ NðjÞ di dj wij ¼ ; ð4Þ : 0 otherwise NðiÞ
where di ¼ max{d(xi, xj) | xj AN(i)}. NðiÞ di is the maximum distance between an observation xi and its neighbors within the mutual k-nearest neighbor graph. The weights are close to one if the nodes are similar to each other and zero otherwise. By bounding the values within the mutual neighboring relationship, this weight function can be used to describe the mutual k-nearest neighbor graph. In this step, the densities of each node are computed based on the neighborhood and the modified weight function. In graphical approaches, various centrality measures are available, such as degree [48], closeness centrality, betweenness centrality [49], and Bonacich power centrality [50]. These centrality measures are considered the densities of nodes. In this study, we propose a novel centrality measure of a graph to accommodate clustering problems. We call the proposed measure the density coefficient. Definition 3. Density coefficient. The density coefficient of node i, di, is given by P P di ¼ wij þ wjk ð5Þ j A NðiÞ
neighborhoods. The proposed density measure considers not only the degree of connectivity of a node to its neighbors, but also the mutual connectivity of its neighbors to themselves. Hence, a node that has a large density coefficient will be located in a locally densely populated region of the graph. This implies that nodes having a larger density coefficient value are closer to the center of each cluster than others. We call these “core nodes.” Nodes with smaller coefficients are likely located in sparser regions or in the boundary regions between clusters. We call these “surrounding nodes.” After calculating the density coefficients of all nodes, those with a low density coefficient value are classified, based on a threshold value, as surrounding nodes and temporarily removed from the graph. The threshold value we use is determined with the 100 α percentile value of the density coefficient in which parameter α is predefined. These observations are then removed if their corresponding density coefficients are below the threshold. This threshold can be determined from the probability density function of the data. Fig. 2 displays an example of the distribution of the density coefficient. As shown in Fig. 2, the density coefficient may not follow known probability distributions. Thus, we propose to use a bootstrap method to estimate the threshold. The bootstrap method is a widely used resampling technique that does not require any distributional assumptions about the data for statistical inference [51]. To apply the bootstrap method, the number of resampling iterations should be predefined. In general, samples of more than 500 do not significantly improve the results of the bootstrap method [51]. For this reason, we set the resampling number at 500. The threshold can be determined as the arithmetic mean of the 100 α percentile value of 500 samples. After the classification process, the core and surrounding node sets are denoted as Λ and Ф, respectively. 2.3. Clustering core nodes based on a maximizing intraconnectivity algorithm Having distinguished between surrounding and core nodes, the core nodes are clustered to maximize overall connectivity. This maximization process is mainly divided into initialization and merging steps. In the initialization step, several clusters are constructed from the core nodes. First, the node with the largest density coefficient is selected, and its neighbors are formed into a cluster. The density coefficient of the selected node can be considered as the density of 150
125
100
75
50
25
j;k A NðiÞ; j o k
The first term of Eq. (5) is the degree of the node i, and the second term represents a sum of weights between the neighboring nodes that takes into consideration the connections between
0
0
100
200
300
400
500
600
Fig. 2. A histogram of the density coefficients.
700
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
477
Fig. 3. Graphical illustration of a maximizing intraconnectivity algorithm (a) Construction of 15 clusters through an initialization step, (b) Agglomeration process of these clusters in the merging step, (c) The result summarized as a hierarchical structure.
478
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
the cluster. Eq. (6) denotes the density of the cluster Cm, D(Cm). X X P DðC m Þ ¼ wjk ¼ wij þ wjk ¼ di ð6Þ j A NðiÞ
j;k A C m
j;k A NðiÞ; j o k
where Cm ¼{xi|xi A Λ} [ {N(i)|N(i) A Λ} This procedure is repeated until all core nodes are contained within defined clusters. Next, the connectivity between these clusters is calculated. The connectivity between clusters Cm and Cn, E(Cm, Cn), can be computed as follows: P wij : ð7Þ EðC m ; C n Þ ¼ i A Cm; j A Cn
Eq. (7) expresses the connectivity between clusters as the sum of all edge weights across the clusters. In the merging step, the initial clusters are hierarchically agglomerated into larger clusters to maximize intraconnectivity within the cluster. More details of this step are as follows: At first, the densest cluster is selected, so that this cluster maximizes the internal connectivity. Next, a connected cluster set from the selected cluster should be found in which the connected cluster set of Cm, Г(Cm), can be defined as follows: Γ ðC m Þ ¼ C n A SEðC m ; C n Þ 4 0 ; ð8Þ
Fig. 4. Assignment to a cluster using a weighted majority voting scheme.
where S is the set of clusters. It is obvious that all clusters connected with each other should be merged so as to maximize the internal connectivity of the agglomerated cluster. The clusters are merged, and each connected cluster is updated as a merged cluster. After merging, the densities of the merged clusters are recalculated. The density of merged cluster Cm0 , D(Cm0 ), can be computed as follows: X X DðC n Þ þ E ðC m ; C n Þ D C 0m ¼ DðC m Þ þ ¼
X
C n A Γ ðC m Þ
wij þ
X
X
C n A Γ ðC m Þi;j A C n
i;j A C m
¼
X
C n A Γ ðC m Þ
wij þ
X
P
C n A Γ ðC m Þi A C n; j A C n
wij :
wij ð9Þ
i;j A C 0m
Eq. (9) implies that the density of the merged cluster is the overall sum of weights within the cluster. The merged clusters are deleted from S, and this procedure is repeated until all clusters in S have been processed. This merging procedure is repeated until no connectivity between clusters is found. The following figure illustrates the overall process of a maximizing intraconnectivity algorithm with a graphical example. In Fig. 3a, the initialization step constructs 15 clusters. After composition of these initial clusters, all connected clusters are merged to maximize the intraconnectivity of the clusters. These clusters should be agglomerated until there are no further connections between clusters. Fig. 3b shows the grouping process; the result of this process is summarized in Fig. 3c as a hierarchical structure. In this figure, the 15 initial clusters are grouped into two large clusters. Detailed process of maximizing intraconnectivity algorithm is represented as a pseudo code format in Appendix A, referring a detailed description of all functions and elements in the pseudo code. 2.4. Assignment of surrounding nodes In this next step, we assign the surrounding nodes to groups that have already been identified. If surrounding nodes belong to a neighborhood set of previously clustered nodes, they are assigned to those clusters. We call these nodes neighbor surrounding nodes. Each neighbor surrounding node is assigned a cluster label via a weighted majority voting scheme. The weighted majority voting
Fig. 5. A graphical example for a reassignment procedure (a) Surrounding nodes 1 and 2 are connected to Cluster 1 and surrounding node 2 is connected to Cluster 2, (b) Assigning the surrounding nodes based on the weighted majority voting scheme, (c) Node 3 is detected as incorrectly assigned node based on all weights connected to node 3, (d) Reassigning the incorrectly assigned node to the correct cluster label.
scheme is expressed in Eq. (10). X wij : yi ¼ arg max Cm
ð10Þ
j A Cm
In the above equation, a neighbor surrounding node i is allocated to the cluster to which it is maximally connected. Fig. 4 illustrates the weighted majority voting scheme in which the dashed circle represents the surrounding node in question. The sum of weights that connect this surrounding node and Cluster 1 is 1.6. Clusters 2 and 3 are connected to this surrounding node with weighted sums of 1.2 and 0.6, respectively. In this case, the surrounding node is assigned to Cluster 1, to which it is the most cohesively connected. Each cluster is repeatedly updated by assigning the surrounding nodes until no more neighbor surrounding nodes exist. The nodes not
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
k=40, α=0.3
479
k=40, α=0.7 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
2 1.5
1.5
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-1
-1.5
-1.5
-2
-2
-2.5
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Cluster 10
2
-2.5 -2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
-2.5
k=160, α=0.3
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
k=160, α=0.7 Cluster 1
2
Cluster 1 Cluster 2 Cluster 3
2
1.5
1.5
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-1
-1.5
-1.5
-2
-2 -2.5
-2.5 -2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
Fig. 6. Proposed clustering algorithm with different parameters (a) k ¼40 and α¼ 0.3, (b) k ¼40 and α¼ 0.7, (c) k ¼160 and α ¼ 0.3, (d) k ¼160 and α¼ 0.7.
Density-based silhouette index calculation procedure Step 1. Check whether there are outliers or fake clusters. If there are outliers or fake clusters, remove them. Step 2. Calculate scaled distances for all pairs of remaining nodes. Step 3. Find the shortest paths between all pairs of remaining nodes using the scaled distance. Step 4. With the shortest path, compute a density-based geodesic distance for all pairs of remaining nodes. Step 5. Calculate the silhouette index using the density-based geodesic distances. Fig. 7. Procedure for calculating the density-based silhouette index.
assigned to any cluster are reported as outliers. A detailed description of the assigning process is represented as a pseudo code in Appendix B. Remarks. We adopted a mutual k-nearest neighborhood graph construction scheme in which some nodes may not have edges connected to other nodes [52]. These nodes do not belong to any defined clusters, and thus we label them as “outliers.” Furthermore, very small clusters, which we call “fake clusters,” are ignored as well [23].
2.5. Iterative examination and reassignment of cluster label After the assignment step, each surrounding node is examined to ascertain if it has been allocated properly. If a node does not belong to the cluster to which it has the largest weighted connection, it is considered incorrectly assigned. There can be several incorrectly assigned nodes because the assignment step does consider
480
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
6
6 5
4 4
2 3
0
2
-2
1
-4
0
-6
-1 -2
-8
-3
-10 -6
-4
-2
0
2
4
6
-10
-8
-6
-8
-6
-4
-4
-2
0
2
4
6
10
6
8
5 6
4 4
3
2
2
0
1
-2
0
-4
-1
-6
-2
-8
-3 -6
-4
-2
0
2
4
-10 -10
-2
0
2
4
6
8
10
Fig. 8. The simulated datasets (a) Scenario 1, (b) Scenario 2, (c) Scenario 3 and (d) Scenario 4.
connections between surrounding nodes. These nodes should be reallocated to the correct clusters for more complete clustering. This procedure is called a rearranging procedure illustrated in Fig. 5. In Fig. 5a, surrounding nodes 1 and 2 (dashed circles) are connected to Cluster 1 with weighted sums of 0.8 and 0.9, respectively, surrounding node 3 (dashed circle) is connected to Cluster 2 with the weighted sum of 0.3. Therefore, surrounding nodes 1 and 2 are assigned to Cluster 1, and surrounding node 3 is assigned to Cluster 2, as shown in Fig. 5b. However, surrounding node 3 can be detected as an incorrectly assigned node if there are strong connections that link surrounding node 3 to surrounding nodes 1 and 2. These connections between surrounding nodes are expressed in Fig. 5c by a dashed bold line. In this example, surrounding node 3 is connected to surrounding nodes 1 and 2 with weighted sum of 0.9, whereas it is connected to Cluster 2 with the weighted sum of 0.3. This indicates that node 3 is more closely connected to Cluster 1 than to Cluster 2. Thus, for a more appropriate assignment, surrounding node 3 should be reassigned to Cluster 1 from Cluster 2, initially assigned (Fig. 5d). This rearranging procedure can improve the effectiveness of partitioning. To evaluate the effectiveness of the partitioning result, the graph cut between clusters is computed. The graph cut
can be expressed as the sum of costs of all nodes. The cut and cost of nodes i, γi, are defined in Eqs. (11) and (12). Cut ¼
n X
γi :
ð11Þ
wij :
ð12Þ
i¼1
γi ¼
X yi a yj
In Eq. (12), yi denotes a cluster label assigned to node i. As shown in this equation, the cost of the node is the sum of all of its weights that are linked to other clusters. The cut, which is the sum of the costs of all nodes, represents the interconnectivity between clusters. Hence, the cut should be minimized to ensure that memberships in the data are appropriately identified. The graph cut can be improved by rearranging the incorrectly assigned node. For example, the graph cut is 0.9 in Fig. 5c. However, the cut is improved to 0.3 in Fig. 5d by reassigning node 3 to Cluster 1. In the final step, all nodes undergo the examination and rearranging procedures. To guarantee the cut is minimal, all nodes should be checked, whether or not the cluster labels are appropriately assigned. In this step, the several incorrectly assigned nodes, including the core nodes, are relabeled to minimize the cut
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
reading from the left to right columns, that given the same k, parameter α controls both the degree of distinction between clusters and their overall number. A larger value of α tends to produce more clusters. From the top to the bottom of Fig. 6 shows the effect of varying k, given the same α. A larger value of k decreases the degree of distinction between clusters and shows several inhomogeneous clusters being merged. In graph-based clustering, it has been empirically proven that the optimal parameter k is a function of the number of observations [13,23]. Maier et al. [45] calculated the bounds of the optimized parameter k for graph-based clustering. However, they examined the effect of parameter k on a specific clustering algorithm, namely Shi and Malik’s [18] algorithm. To the best of our knowledge, no general guidelines exist for selecting parameter k. Hence, in the current study, we tried to find the optimal parameter k through a heuristic approach. More precisely, we defined k as in Eq. (13) and varied the values of ρ as 0.01, 0.02, 0.04, 0.06, and 0.08 in this simulation study:
if these nodes exist. This reassignment procedure is repeated until the cut is no longer improved. In Appendix C, a pseudo code is provided for detailed information of this step.
3. Simulation study 3.1. Parameter setting The proposed algorithm contains two parameters, k and α. Parameter k is used to construct the mutual k-nearest neighborhood graph, and α is used to determine the threshold that distinguishes between core and surrounding nodes. Fig. 6 uses a simulated dataset to show the effects of varying the parameters of the proposed algorithm. It can be observed, Table 1 Parameter selection using density-based silhouette index. Scenario
Scenario Scenario Scenario Scenario
1 2 3 4
Normalized spectral clustering
Modularity-based clustering
DENGP (Proposed)
k ¼120, K¼3 k ¼120, K¼2 k ¼240, K¼ 3 k ¼35, K¼ 2
k ¼ 120 k ¼ 80 k ¼ 240 k ¼ 35
k ¼ 120, α ¼0.4 k ¼ 120, α ¼0.6 k ¼ 240, α¼ 0.6 k ¼ 35, α¼ 0.2
481
k ¼ ⌈ρN⌉;
ð13Þ
where N is the number of observations. In addition, we varied the values of α from 0.2 to 0.8 with a step size of 0.2. We then selected the parameter set (k, α) that produced the best clustering performance. For evaluation of the set of parameters selected, we propose a modified version of the
k = 120, K=3
k = 120
6
Cluster 1 Cluster 2 Cluster 3
5
6
4
4
3
3
2
2
1
1
0
0
-1
-1
-2
-2
-3
Cluster 1 Cluster 2 Cluster 3
5
-3 -6
-4
-2
0
2
4
6
-6
-4
-2
0
2
4
6
k = 120, α = 0.4 6
Cluster 1 Cluster 2 Cluster 3
5 4 3 2 1 0 -1 -2 -3 -6
-4
-2
0
2
4
6
Fig. 9. Simulation results of Scenario 1 (a) Normalized spectral clustering (k ¼ 120, K¼ 3), (b) Modularity-based clustering (k ¼ 120) and (c) DENGP (k ¼120, α¼ 0.4).
482
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
k = 80
k = 120, K=2 Cluster 1 Cluster 2
6 4
4
2
2
0
0
-2
-2
-4
-4
-6
-6
-8
-8
-10
Cluster 1 Cluster 2 Cluster 3
6
-10 -10
-8
-6
-4
-2
0
2
4
6
-10
-8
-6
-4
-2
0
2
4
6
k = 120, α = 0.6 Cluster 1 Cluster 2
6 4 2 0 -2 -4 -6 -8 -10 -10
-8
-6
-4
-2
0
2
4
6
Fig. 10. Simulation results of Scenario 2 (a) Normalized spectral clustering (k ¼120, K¼2), (b) Modularity-based clustering (k ¼120) and (c) DENGP (k ¼ 120, α ¼0.6).
silhouette index based on the geodesic distance. The silhouette index based on geodesic distance is one of the most popular measures for evaluating the performance of graph-based clustering algorithms [53]. However, this approach is not appropriate for the evaluation of clustering results in noisy cases because the existing measures of geodesic distance may not reflect the nonlinear relationships in a noisy dataset [54]. Hence, we propose a noise-resistant silhouette index and define the measure as a density-based silhouette index. Fig. 7 demonstrates the procedure of computing the density-based silhouette index. As presented in Fig. 7, the outliers and fake clusters are removed to compute the density-based silhouette index. Having eliminated these nodes, the scaled distances between all nodes are calculated by the following equation:
δij ¼
( 1w di dj
if wij 4 0
1
otherwise
ij
;
ð14Þ
where wij is the weight of the edge between node i and j, and di and dj are the density coefficients of nodes i and j, respectively.
By dividing the distance by the density coefficient, the connection between two dense nodes becomes relatively close, and vice versa for the connection between two sparse nodes. The scaled distance can be used to calculate all pairs of shortest paths and geodesic distances. The geodesic measure returned through use of this scaled distance serves as the density-based geodesic distance. This density-based geodesic distance becomes more robust to noise by reflecting the density of each node. Finally, the silhouette index based on the density-based geodesic distance is computed by the following equation: N P
S¼
sðiÞ
i¼1
N
;
ð15Þ
where s(i)¼ (bi – ai)/max(ai, bi). Here ai denotes the average density-based geodesic distance from observation i to all other points in its own cluster between, and bi represents the average density-based geodesic distance between other nodes belonging to the closest cluster to node i. In Eq. (15), the silhouette index represents an average value of
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
k = 240, K = 3
483
k = 240
Cluster 1 Cluster 2 Cluster 3
6 5
Cluster 1 Cluster 2 Cluster 3 Cluster 4
6 5
4
4
3
3
2
2
1
1
0
0
-1
-1
-2
-2 -3
-3 -6
-4
-2
0
2
-6
4
-4
-2
0
2
4
k = 240, α = 0.6 Cluster 1 Cluster 2 Cluster 3
6 5 4 3 2 1 0 -1 -2 -3 -6
-4
-2
0
2
4
Fig. 11. Simulation results of Scenario 3 (a) Normalized spectral clustering (k ¼240, K ¼3), (b) Modularity-based clustering (k ¼240) and (c) DENGP (k ¼240, α ¼0.6).
individual silhouettes. We define this measure as the densitybased silhouette index and use it to select the appropriate set of parameters (k, α). To determine a set of the parameters to use for comparison in the normalized spectral clustering algorithm, the density-based silhouette index is used to select the parameters for constructing a graph k and also the number of clusters K. For comparisons with modularity-based clustering approaches, we selected the graph construction parameter k in the same way, using the proposed density-based clustering algorithm. 3.2. Simulation setup We conducted a simulation study to examine the properties of the proposed algorithm and compare it with the following algorithms: Ng’s algorithm (normalized spectral clustering algorithm; [26]) and Fast Newman algorithm (modularity-based clustering algorithm; [11]). The simulation used four scenarios. To facilitate visualization of the results, the experiment used two-dimensional data. Fig. 8a shows the data generated from three bivariate normal distributions with different mean values (but the same variance). Note that the clusters contain several noisy observations. Each
cluster consists of 1000 observations. Fig. 8b shows two bananashaped clusters, each containing 1000 observations. In this case, these two clusters are also somewhat noisy. Fig. 8c illustrates the data generated from three bivariate normal distributions with different mean values and sizes (but the same variance). The largest cluster contains 2400 observations; the other two clusters contain 300 observations each. Fig. 8d shows two ring-shaped clusters and one Swiss roll-shaped cluster that exhibit nonlinear data [15]. The Swiss roll, inner ring, and outer ring clusters contain, respectively, 1000, 1200, and 1300 observations. MATLAB and igraph in the R package generated these simulated data. Table 1 shows the results of parameter selection in each scenario. 3.3. Simulation results Fig. 9 shows the results under Scenario 1 from the proposed clustering algorithm after those of normalized spectral clustering and greedy-based modularity clustering. In this scenario, all three algorithms correctly found three clusters. This implies that all three methods can properly accommodate a centralized graph that exhibits a dense pattern around the mean value.
484
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
10
k = 35, K= 2
10
Cluster 1 Cluster 2
8
6
4
4
2
2
0
0
-2
-2
-4
-4
-6
-6
-8
-8
-8
-6
-4
-2
0
2
10
4
6
8
10
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8
8
6
-10 -10
k = 35
-10 -10
-8
-6
-4
-2
0
2
4
6
8
10
k = 35, α = 0.2 Cluster 1 Cluster 2 Cluster 3
8 6 4 2 0 -2 -4 -6 -8 -10 -10
-8
-6
-4
-2
0
2
4
6
8
10
Fig. 12. Simulation results of Scenario 4 (a) Normalized spectral clustering (k ¼35, K ¼2), (b) Modularity-based clustering (k ¼35) and (c) DENGP (k ¼35, α ¼ 0.2). Table 2 Comparison of clustering effectiveness in terms of Rand index on simulation datasets.
Table 3 Summary of datasets. Dataset
Scenario
Scenario Scenario Scenario Scenario
Number of observations
1 2 3 4
3000 2000 3000 3500
Rand index Normalized spectral clustering
Modularitybased clustering
DENGP (Proposed)
0.97 0.75 0.97 0.74
0.95 0.77 0.69 0.8
0.97 0.98 0.92 0.99
Fig. 10 shows the results from the three clustering algorithms under Scenario 2. It can be seen that the proposed DENGP algorithm successfully found the two nonlinear groups. The proposed algorithm maximizes the intraconnectivity within a cluster and thus appropriately discovers the nonlinear patterns. On the other hand, the normalized spectral clustering algorithm failed to identify the banana-shaped clusters because of the noise around the clusters. As mentioned earlier, noise must be carefully removed as a prerequisite to successful application of the normalized spectral clustering
Iris WDBC-I WDBC-II Parkinson Banknote Authentication Indian Liver Patient Congressional Voting Lymphoma Auto MPG Cloud
Number of variables
Number of observations
Number of classes
4 30 9 19 4
150 569 683 195 1372
3 2 2 2 2
10 16 4027 8 10
583 232 62 392 1024
2 2 2 Unlabeled Unlabeled
technique. Furthermore, modularity-based clustering also failed to identify the clusters properly. A modularity-based clustering algorithm is ineffective when an inferred graph lacks a centralized structure. Fig. 11 shows that both the proposed DENGP algorithm and the normalized spectral clustering algorithm correctly found the three
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
clusters. However, the modularity-based clustering algorithm divided the largest cluster into two. As noted earlier, the performance of modularity-based clustering algorithms deteriorates when used with clusters of different sizes. Fig. 12 displays the results from the three clustering algorithms under Scenario 4, which reflects nonlinearity and locally patterned data. The proposed DENGP algorithm successfully identified the three nonlinear groups. Because the proposed algorithm maximizes the internal connectivity of clusters by considering density,
485
it successfully discovered these noisy and nonlinear patterns. On the other hand, the normalized spectral clustering algorithm failed to distinguish between the two ring-shaped clusters because of its vulnerability to noise around nonlinear clusters. Moreover, the modularity-based clustering algorithm’s results were unsatisfactory for the same reason as in Scenario 2. In addition to visual comparison, we used Rand indices to compare the performance of the three clustering algorithms Table 5 Comparison of clustering effectiveness in terms of Rand index on eight real case datasets (Class-labeled cases).
Table 4 Parameter selection based on the density-based silhouette index. Dataset
Normalized spectral clustering
Modularitybased clustering
DENGP (Proposed)
Iris WDBC-I WDBC-II Parkinson Banknote Authentication Indian Liver Patient Congressional Voting Lymphoma Auto MPG Cloud
k ¼20, K¼2 k ¼114, K¼ 2 k ¼130, K¼ 2 k ¼20, K¼2 k ¼42, K¼2
k ¼ 26 k ¼114 k ¼ 137 k ¼ 33 k ¼ 42
k ¼ 20, α¼ 0.2 k ¼ 103, α¼ 0.7 k ¼ 130, α¼ 0.3 k ¼ 20, α¼ 0.3 k ¼ 42, α¼ 0.5
k ¼70, K ¼7 k ¼35, K ¼2 k ¼8, K¼3 k ¼36, K¼5 k ¼82, K¼2
k ¼ 52 k ¼ 46 k ¼ 20 k ¼ 32 k ¼ 82
k ¼ 70, α¼ 0.5 k ¼ 24, α¼ 0.5 k ¼ 8, α¼ 0.5 k ¼ 24, α¼ 0.4 k ¼ 41, α ¼0.6
Datasets
Iris WDBC-I WDBC-II Parkinson Banknote Authentication Indian Liver Patient Congressional Voting Lymphoma
Rand index Normalized spectral clustering
Modularitybased clustering
DENGP (Proposed)
0.75 0.85 0.94 0.56 0.57
0.82 0.71 0.74 0.51 0.67
0.76 0.85 0.93 0.57 0.76
0.45 0.76 0.76
0.5 0.78 0.54
0.54 0.82 0.76
Fig. 13. 3-D PCA score plots of Auto MPC data (a) No clustering, (b) Normalized spectral clustering (k ¼36, K¼ 5), (c) Modularity-based clustering (k ¼ 32), and (d) DENGP (k ¼24, α ¼0.4).
486
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
Fig. 14. 3-D PCA score plots of the Cloud data (a) No clustering, (b) Normalized spectral clustering (k ¼ 82, K¼2), (c) Modularity-based clustering (k ¼82), and (d) DENGP (k¼ 41, α ¼ 0.6).
considered in this study. The Rand index evaluates the agreement between the clustering results and the true class labels, which are known as ground truth [12]. The Rand index between ground truth π and clustering result C, RI(π,C), can be computed by RI ðπ ; C Þ ¼
n00 þ n11 2ðn00 þn11 Þ ; ¼ nðn 1Þ n00 þn01 þ n10 þ n11
ð16Þ
where n00 ¼the number of pairs of nodes having the same cluster label and same class label, n01 ¼the number of pairs of nodes having the same cluster label and different class label, n10 ¼the number of pairs of nodes having different cluster labels and the same class label, n11 ¼the number of pairs of nodes having different cluster labels and different class labels. A high Rand index implies that a clustering algorithm correctly recovers the original classes. The value of Rand indices ranges between 0 and 1. If the value is close to one, the clustering algorithm yields more accurate clusters, and the opposite if it is close to zero. Table 2 shows the Rand indices for the three clustering
algorithms under the four simulation scenarios. The results indicate that the proposed DENGP algorithm yields a large Rand index value, demonstrating its superiority over the other algorithms. 4. Case study We also conducted experiments on 10 benchmark datasets obtained from the University of California Irvine Machine Learning Repository (http://archive.ics.uci.edu), and an open source (http:// www.ligarto.org/rdiaz/index.html). Table 3 provides a summary of these datasets. Of these 10, the Congressional Voting Case differs from the others because it is composed of 16 binary variables. Consequently, we use Jaccard metrics to compute the distances between the observations to construct a graph. We used Euclidean distances for the other datasets. As with the simulation study, appropriate parameters were selected by searching through potential values in a certain range and finding the one that maximized the density-based silhouette index. Table 4 provides the set of best parameters for the three clustering algorithms for each dataset. For the eight class-labeled datasets, we compared the performances of three clustering
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
algorithms in terms of Rand index. Table 5 shows the Rand indices of the three clustering algorithms. The comparison results clearly demonstrate that the proposed DENGP algorithm generally performs better than the other algorithms in that for six datasets out of eight it produced equal or larger Rand indices than the other two algorithms. This shows that the proposed algorithm discovers the natural groupings more appropriately than existing algorithms. The proposed DENGP was especially superior with the Banknote Authentication data. The intrinsic clusters in Banknote Authentication data displayed clear nonlinear patterns. This result again demonstrates the superiority of the proposed algorithm with nonlinear data. For unlabeled datasets (Auto MPG and Cloud data), we compared the clustering results with a visualization technique. To visualize the clustering results, we represented clustering results by using three-dimensional principal component analysis (PCA) score plots. Figs. 13 and 14 display the clustering results of Auto MPG and Cloud data, respectively. Fig. 13 shows data structure and the clustering results from three algorithms of Auto MPG data. From Fig. 13a, we observed five clusters and some potential outliers and noises around the clusters. In this case, the normalized spectral clustering algorithm
487
failed to identify the appropriate clusters because of outlying patterns. The modularity-based clustering algorithm successfully found the intrinsic five clusters, but it produced too many clusters, and some clusters seem meaningless (e.g., Clusters 6, 7, 8 and 9). On the contrary, the proposed DENGP algorithm successfully discovered the intrinsic five clusters and some potential outliers. Fig. 14 shows data structure and the clustering results from three algorithms of Cloud data. We could observe from Fig. 14a that there are two nonlinear groups and several noises around the clusters, which is similar with Scenario 2 in simulated data. As shown in Fig. 14b and c, the normalized spectral clustering and modularity-based clustering algorithm failed to identify the clusters properly due to nonlinearity and noises. The proposed DENGP algorithm successfully found the two nonlinear clusters (Fig. 14d). It should be noted that the proposed algorithm performed worse than the modularity-based clustering algorithm with the Iris data. Fig. 15 shows our effort to find the reason for this; in it, a principal component analysis score plot and a mutual k-nearest neighbor graph structure represent the Iris data. As shown in Fig. 15a, one of the three classes (Setosa) is separated from the other two. The optimal graph structure is constructed with no connections between Setosa and the other two classes (Versicolor and Virginica) in reflecting this distribution (Fig. 15b). With this graph structure, partitioning into two groups maximized the density-based silhouette index. The normalized spectral clustering also erroneously determined two groups because it similarly failed to distinguish between Versicolor and Virginica when using the parameters chosen to maximize the density-based silhouette index.
5. Computational complexity analysis In the current study, we analyzed the computational complexity of the proposed DENGP algorithm with execution times. We conducted the experiments on an Intel s Core™ i7-4790 CPU @ 3.6 GHz computer with 16 GB memory. Table 6 shows the execution times of three clustering algorithms in 10 benchmark datasets and four simulation scenarios. Table 6 indicates that the DENGP algorithm had slightly higher computational complexity than other clustering algorithms. This is because the DENGP algorithm adopts bootstrapping, known as a computationally expensive technique, to estimate the threshold for classifying the core and surrounding nodes. However, this may Table 6 Comparison of computational complexities in terms of execution time on 14 datasets. Dataset
Execution time (s) Normalized spectral clustering
Fig. 15. Graphical representation of the Iris case with (a) two-dimensional principal component analysis plot, and (b) mutual k-nearest neighbor graph structure (k is 20).
Iris WDBC-I WDBC-II Parkinson Banknote Authentication Indian Liver Patient Congressional Voting Lymphoma Auto MPG Cloud Scenario 1 Scenario 2 Scenario 3 Scenario 4
Modularitybased clustering
DENGP (Proposed)
0.08 0.21 0.27 0.09 1.16
0.04 0.22 0.30 0.05 0.71
0.26 0.94 1.09 0.29 2.05
0.62 0.09 0.08 0.27 0.62 4.00 2.31 3.81 11.20
0.18 0.06 0.05 0.08 0.50 3.87 1.62 6.24 4.00
0.92 0.34 0.12 0.57 1.76 8.33 4.32 9.04 12.22
488
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
be not a serious problem with modern computing power. We found that several efficient threshold estimation methods [55,56] that may improve the DENGP algorithm computationally. This is an interesting future study.
6. Conclusions In the present study, we proposed a new clustering approach, termed density-based noisy graph partitioning. The proposed idea can compensate for the limitations of the graph-based clustering algorithms currently used because it successfully discovers noisy and nonlinear patterns in the dataset without the need to specify the number of clusters in advance. In addition, we proposed a density-based silhouette index measure for evaluating a set of parameters (k and α). Experiments on simulation and real case data demonstrated the accuracy and robustness of the proposed algorithm. Experimental results show that the proposed DENGP algorithm outperformed other graph-based clustering algorithms, especially when the data exhibited nonlinear patterns with noise. For further study, we will apply the DENGP algorithm to largesized datasets, such as social network data and text-formed data. To do this, we will conduct a study to improve the computational efficiency of the proposed algorithm. In addition, we will conduct a study to optimize the parameters in the proposed algorithm.
Acknowledgements We thank the editor and referees for their constructive comments and suggestions, which greatly improved the quality of the paper. This research was supported by Brain Korea PLUS and Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Science, ICT and Future Planning (2013007724).
Appendix A. Pseudo code for the maximizing intraconnectivity algorithm Fig. A.1 summarizes the maximizing intraconnectivity algorithm as explained in Section 2.3. As referenced in the section, the maximizing intraconnectivity algorithm can be divided into two steps, initialization and merging. First, the node with the largest density coefficient should be selected with the SEARCH _DENSEST_NODE function. The selected node and its neighborhoods comprise a cluster with the UPDATE function. The UPDATE function also stores the density coefficient of the selected node as the density of the cluster. Next, the connectivity between the clusters is calculated through the CALCULATE_CONNECTIVITY function. In the merging step, the densest cluster is first selected through the SEARCH_DENSEST_CLUSTER function. The clusters are merged with the Cluster. MERGE function. After the agglomeration of
Maximizing Intraconnectivity Algorithm Construct_Subcluster_Maximizing_Connectivity (Core_Nodes) # Initialization step Cluster := NULL; WHILE Core_Nodes~= Empty Seed<- SEARCH_DENSEST_NODE(Core_Nodes); Neighboring_Core <- Seed.Neighborhood; Seed_List<-COMPOSE_SEED_LIST (Seed, Neighboring_Core); Initial_Cluster <- Cluster.UPDATE (Seed_List); Core_Node.DELETE(Seed_List); END WHILE Cluster <- Initial_Cluster; Connectivity_Cluster <-CALCULATE_CONNECTIVITY(Cluster);
# Merging step WHILE Connectivity_Cluster.SUM()~= 0 Cluster_Set<- Cluster; WHILE Cluster_Set ~=Empty Seed <-SEARCH_DENSEST_CLUSTER (Cluster_Set); Connected_Clusters <- Seed.Connected_Cluster; Seed_List<-COMPOSE_SEED_LIST (Seed, Connected_Clusters); Cluster <- Cluster.MERGE(Seed_List); Connectivity_Cluster <- Connectivity_Cluster.UPDATE(Cluster); tmp_Cluster.DELETE (Seed_List); END WHILE Connectivity_Cluster.CALCULATE_CONNECTIVITY(Cluster); END WHILE Return(Cluster); END Fig. A.1. Pseudo code for a maximizing intraconnectivity algorithm.
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
clusters, the connectivities between the agglomerated clusters are computed with the CALCULATE_CONNECTIVITY function. This merging procedure is repeated until the clusters are no longer connected.
489
majority voting scheme is implemented by the Seed. WEIGHTED _MAJORITY_VOTING function.
Appendix C. Pseudo code for iterative examination and reassignment of cluster label Appendix B. Pseudo code for assignment of surrounding nodes Fig. B.1 shows the assignment step explained in Section 2.4 as a pseudo code format. As referred to in the section, neighbor surrounding nodes are searched by the Graph. SEARCH function. Next, the cluster labels are allocated to the neighbor surrounding nodes based on a weighted majority voting scheme. The weighted
Fig. C.1 describes the iterative examination and reassignment step explained in Section 2.5 as a pseudo code format. At first, check whether there are incorrectly assigned nodes or not. This examination procedure is implemented through the seed. CHECK_LABEL function. If a node does not belong to the cluster to which it has the largest weighted connection, the number of incorrectly assigned nodes, Number_Incorrectlty_Assigned_Node
Assignment of Surrounding Nodes Assigning_Surrounding_Nodes (Graph, Cluster, Surrounding_Node) WHILE Cluster.Size does not change Surrounding_Connect_to_Cluster <- Graph.SEARCH(Cluster, Surrounding_Node); FOR i FROM 1 TO Surrounding_Connect_to_Cluster.Size Seed<- Surrounding_Connect_to_Cluster(i); Neighboring_Core <-Seed.Neighborhood(IFSeed.Neighborhood == Core_Node); Seed.Label <- Seed.WEIGHTED_MAJORITY_VOTING(Graph, Neighboring_Core); Cluster <- Cluster.UPDATE(Seed, Seed.Label); END FOR END WHILE Return(Cluster); END Fig. B.1. Pseudo code for the assignment step.
Iterative Examination and Reassignment of Cluster Label Iterative_Examination_Reassignment (Graph, Cluster) // Examine whether there are incorrectly assigned nodes or not // Number_Incorrectlty_Assigned_Node <- 0 FOR Seed FROM 1 TO Data_Size IF Seed.CHECK_LABEL~= TRUE THEN Number_Incorrectlty_Assigned_Node = Number_Incorrectlty_Assigned_Node + 1 END IF END FOR // If there are incorrectly assigned nodes, reassign these nodes and examine whether there are incorrectly assigned nodes // // This procedure is processed until the graph cut does not improved // IF Incorrectlty_Assigned_Node ~= 0 Graph_Cut <- Cluster.CALCULATE_CUT(Graph) WHILE Graph_Cut does not improved FOR Seed FROM 1 TO Data_Size IF Seed.CHECK_LABEL~= TRUE THEN Seed.Label <- Seed.WEIGHTED_MAJORITY_VOTING(Graph, Seed.Neighborhood); Cluster <- Cluster.UPDATE(Seed, Seed.Label); END IF END FOR Graph_Cut <- Cluster.CALCULATE_CUT(Graph) END WHILE END IF Return(Cluster); END Fig. C.1. Pseudo code for the iterative examination and reassignment step.
490
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491
in this pseudo code, is increased. If all nodes are correctly assigned, this procedure is terminated. If not, the incorrectly assigned nodes should be reassigned. Prior to reassignment, the graph cuts between clusters are computed by the Cluster. CALCULATE_CUT function. Next, all nodes are examined whether their cluster label are allocated correctly or not. If there are several incorrectly assigned nodes, these nodes are reassigned with the Seed. WEIGHTED _MAJORITY_VOTING function.
References [1] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd ed., Wiley, NY, 2001. [2] J. Hu, L. Fang, Y. Cao, H.-J. Zeng, H. Li, Q. Yang, Z. Chen, Enhancing text clustering by leveraging wikipedia semantics, In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2008, pp. 179–186. [3] P. Felzenszwalb, D. Huttenlocher, Efficient graph-based image segmentation, Int. J. Comput. Vis. 59 (2) (2004) 167–181. [4] L. Tarca, V.J. Carey, X.W. Chen, R. Romero, S. Draghici, Machine learning and its applications to biology, PLoS Comput. Biol. 3 (6) (2007) 953–963. [5] G. Kou, C. Lou, Multiple factor hierarchical clustering algorithm for large scale web page and search engine clickstream data, Ann. Oper. Res. 197 (1) (2012) 123–134. [6] J.H. Kang, S.B. Kim, A clustering algorithm-based control chart for inhomogeneously distributed TFT-LCD processes, Int. J. Prod. Res. 51 (18) (2013) 5644–5657. [7] K. Jain, Data clustering: 50 years beyond K-means, Pattern Recognit. Lett. 31 (8) (2010) 651–666. [8] J. MacQueen, Some methods for classification and analysis of multivariate observations, In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297. [9] M. Ester, H.P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, In: Proceedings of the 2nd International Conference of Knowledge Discovery and Data Mining, 1996, pp. 226–231. [10] M.E.J. Newman, M. Girvan, Finding and evaluating community structure in networks, Phys. Rev. E 69 (2) (2004) 026113. [11] M.E.J. Newman, Fast algorithm for detecting community structure in networks, Phys. Rev. E 69 (6) (2004) 066133. [12] P.N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, AddisonWesley, Boston, 2006. [13] M. Maier, M. Hein, U. Von Luxburg, Cluster identification in nearest-neighbor graphs, In: Proceedings of 18th International Conference on Algorithmic Learning Theory, 2007, pp. 196–210. [14] M. Aupetit, T. Catz, High-dimensional labeled data analysis with topology representing graphs, Neurocomputing 63 (2005) 139–169. [15] J.B. Tenenbaum, V. de Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323. [16] K. Ozaki, M. Shimbo, M. Komachi, Y. Matsumoto, Using the mutual k-nearest neighbor graphs for semi-supervised classification of natural language data, In: Proceedings of the 15th Conference on Computational Natural Language Learning, Association for Computational Linguistics, 2011, pp. 154–162. [17] K.H. Anders, M. Sester, Parameter-free cluster detection in spatial databases and its application to Typification, Int. Arch. Photogramm. Remote Sens. 33 (Part B4/1) (2000) 75–83. [18] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000) 888–905. [19] V. Estivill-Castro, I. Lee, A.T. Murray, Criteria on proximity graphs for boundary extraction and spatial clustering, In: Proceedings of the 5th Pacific Asia Knowledge Discovery and Data Mining, 2001, pp. 348–358. [20] P. Foggia, G. Percannella, C. Sansone, M. Vento, A graph-based clustering method and its applications, In: Proceedings of the 2nd International Conference on Advances in Brain, Vision, and Artificial Intelligence, 2007, pp. 277–287. [21] S.E. Schaeffer, Graph clustering, Comput. Sci. Rev. 1 (1) (2007) 27–64. [22] E. Hartuv, R. Shamir, A clustering algorithm based on graph connectivity, Inf. Process. Lett. 76 (4) (2000) 175–181. [23] M. Maier, M. Hein, U. Von Luxburg, Optimal construction of k-nearestneighbor graphs for identifying noisy clusters, Theor. Comput. Sci. 410 (19) (2009) 1749–1764. [24] F.R. Chung, Spectral Graph Theory, 92, American Mathematical Society, 1997. [25] L. Hagen, A. Kahng, New spectral methods for ratio cut partitioning and clustering, IEEE Trans. Comput-Aided Des. 11 (9) (1992) 1074–1085. [26] Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm, In: Proceedings of the 14th Conference on Advances in Neural Information Processing Systems, 2001, pp. 849–856. [27] R. Kannan, S. Vempala, A. Vetta, On clusterings: good, bad and spectral, J. ACM 51 (3) (2004) 497–515. [28] S. B. Patkar, H. Narayanan, An efficient practical heuristic for good ratio-cut partitioning, In: Proceedings of the 16th International Conference on VLSI Design, IEEE Circuit and Systems Society, 2003, pp. 64–69.
[29] M. Filippone, F. Camastra, F. Masulli, S. Rovetta, A survey of kernel and spectral methods for clustering, Pattern Recognit. 41 (1) (2008) 176–190. [30] U. Von Luxburg, A tutorial on spectral clustering, Stat. Comput. 17 (4) (2007) 395–416. [31] D. Verma, M. Meila, A Comparison of Spectral Clustering Algorithms, Technical Report, Department of Computer Science, University of Washington, 2003. [32] R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a data set via the gap statistic, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 63 (2) (2001) 411–423. [33] Z. Li, J. Liu, S. Chen, X. Tang, Noise robust spectral clustering, In: Proceedings of the 11th International Conference on Computer Vision, IEEE Computer Society, 2007, pp. 1–8. [34] U. Brandes, D. Delling, M. Gaertler, R. Gorke, M. Hoefer, Z. Nikoloski, D. Wagner, On modularity clustering, IEEE Trans. Knowl. Data Eng. 20 (2) (2008) 172–188. [35] S. Fortunato, Community detection in graphs, Phys. Rep. 486 (3) (2010) 75–174. [36] L. Pitsoulis Kehagias, Bad communities with high modularity, Eur. Phys. J. B 86 (7) (2013) 1–11. [37] G. Karypis, E.H. Han, V. Kumar, Chameleon: hierarchical clustering using dynamic modeling, Computer 32 (8) (1999) 68–75. [38] S. Dongen, A Cluster Algorithm for Graphs, Technical Report 10, 2000. [39] J. Li, K. Wang, L. Xu, Chameleon based on clustering feature tree and its application in customer segmentation, Ann. Oper. Res. 168 (1) (2009) 225–245. [40] V. Satuluri, S. Parthasarathy, Scalable graph clustering using stochastic flows: applications to community discovery, In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009, pp. 737–746. [41] L. Wasserman Rinaldo, Generalized density clustering, Ann. Stat. 38 (2010) 2678–2722. [42] Steinwart, Adaptive density level set clustering, In: Proceedings of the 24th Annual Conference on Learning Theory, 2011, pp. 703–738. [43] L. Ertöz, M. Steinbach, V. Kumar, Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data, In: Proceedings of SIAM International Conference on Data Mining, 2003, pp. 47–58. [44] J. Shi, Z. Luo, A novel clustering approach based on the manifold structure of gene expression data, In: Proceedings of the 4th International Conference on Bioinformatics and Biomedical Engineering, 2010, pp. 1–4. [45] M. Maier, U. Von Luxburg, M. Hein, Influence of graph construction on graphbased clustering measures, In: Proceedings of the 21st Conference on Advances in Neural Information Processing Systems, 2008, pp. 196–201. [46] L. Zelnik-Manor, P. Perona, Self-tuning spectral clustering, In: Proceedings of the 17th Conference on Advances in Neural Information Processing Systems, 2004, pp. 1601–1608. [47] E. Bicici, D. Yuret, Locally scaled density based clustering, In: Proceedings of the 8th International Conference on Adaptive and Natural Computing Algorithms, 2007, pp. 739–748. [48] J.A. Bondy, U.S.R. Murty, Graph Theory with Applications, 6th ed., Macmillan, London, 1976. [49] L. Freeman, Centrality in social networks: conceptual clarification, Soc. Netw. 1 (3) (1979) 215–239. [50] P. Bonacich, Power and centrality: a family of measures, Am. J. Sociol. 92 (5) (1987) 1170–1182. [51] R.J. Tibshirani Efron, An Introduction to the Bootstrap, Chapman & Hall, NY, 1994. [52] V. Hautamäki, I. Karkkainen, P. Franti, Outlier detection using k-nearest neighbour graph, In: Proceedings of the 17th International Conference on Pattern Recognition, 2004, pp. 430–433. [53] F. Boutin, M. Hascoet, Cluster validity indices for graph partitioning, In: Proceedings of the 8th International Conference on Information Visualisation, IEEE Computer Society, 2004, pp. 376–381. [54] M. Balasubramanian, E.L. Schwartz, The isomap algorithm and topological stability, Science 295 (2002) 7. [55] Z. Qin, V. Petricek, N. Karampatziakis, L. Li, J. Langford, Efficient online bootstrapping for large scale learning, Arxiv preprint arxiv 1312.5021, 2013. [56] S. Basiri, E. Ollila, V. Koivunen, Robust, scalable and fast bootstrap method for analyzing large scale data, Arxiv preprint arxiv 1504.02382, 2015.
Jae Hong Yu received the B.S. degree in the Department of Industrial Management Engineering at Korea University, Seoul, in 2012. Since March 2012, he has been working toward the Ph.D. degree in the Department of Industrial Management Engineering, Korea University in Seoul. His research interests include clustering algorithms for noisy data, and feature selection algorithms for clustering in high dimensional data.
J. Yu, S.B. Kim / Neurocomputing 175 (2016) 473–491 Seoung Bum Kim has been a professor at the Department of Industrial Management Engineering at Korea University since 2009. He received an M.S. degree in Industrial and Systems Engineering in 2001, and M.S. degree in Statistics in 2004, and a Ph.D. degree in Industrial and Systems Engineering in 2005 from the Georgia Institute of Technology. From 2005 to 2009, Dr. Kim was an Assistant Professor in the Department of Industrial and Manufacturing Systems Engineering at the University of Texas at Arlington. His research interests include statistical and data mining modeling for various problems appearing in engineering and science. He is actively involved with the Institute for Operations Research and Management Science (INFORMS), serving as president for the INFORMS Section on Data Mining.
491