CHAPTER 6
Clustering Clustering is frequently used in data analysis and machine learning, for example in pattern recognition. It is also closely related to network design and so plays a prominent role throughout this text. Clustering techniques are used to find clusters meeting some objectives, such as finding the largest or smallest cluster sizes, or division of data into a predefined number of clusters k in an optimal way in some sense. To determine the properties of the clustering itself (as opposed to the data), cluster quality measures are used. We can use different techniques to group general data in an efficient way according to some given criteria. Such a grouping often reveals underlying structures and dependencies in the data that may not be immediately obvious. We will refer to a single observation as a data point, or more generally, a data object, that possesses various properties that can be defined independently of other data objects. Since the relationships between data points can be represented by a graph, many clustering techniques are closely related to graph theory. More formally, the minimum k-clustering problem where a finite data set D is given together with a distance function d : D × D → N, satisfying the triangle inequality [53]. The goal is to partition D into k clusters C1 , C2 . . . , Ck , where Ci ∩ Cj = for i = j , so that the maximum intercluster distance is minimized (that is, the maximum distance between two points assigned to the same cluster). This problem is approximable within a factor of 2, but not approximable within (2 − ) for any > 0. A related problem is the minimum k-center problem, where a complete graph is given with a distance function d : V × V → N and the goal is to construct a set of centers C ⊆ V of fixed order |C| = k such that the maximum distance from a vertex to the nearest center is minimized. Essentially, this is not a graph problem as the data set is simply a set of data and their distances – the edges play no role here. If the distance function satisfies the triangle inequality, the minimum k-center problem can be approximated within a factor of 2, but it is not approximable within (2 − ) for any > 0. Without the assumption of the distance satisfying the triangle inequality, the problem is harder. A capacitated version, where the triangle inequality does hold, but the number of vertices “served” by a single center vertex is bounded from above by a constant, is approximable within a factor of 5. A center serves a vertex if it is the closest center to that vertex. Another capacitated version where the maximum distance is bounded by a constant and the task is to 5G Networks. https://doi.org/10.1016/B978-0-12-812707-0.00011-5 Copyright © 2018 Elsevier Ltd. All rights reserved.
123
124 Chapter 6 choose a minimum-order set of centers is approximable within a factor log(c) + 1, where c is the capacity of each center. The problem is also referred to as the facility location problem. A weighted version of the k-center problem, where the distance of a vertex to a center is multiplied by the weight of the vertex and the maximum of this product is to be minimized, is approximable within a factor of 2, but it can not be approximated within (2 − ) for any > 0. If it is not the maximum distance that is of interest, but the sum of the distances to the nearest center is minimized instead while keeping the order of the center set fixed, the problem is called the minimum k-median problem. Unfortunately, no single definition of a cluster in graphs is universally accepted, and the variants used in the literature are numerous. In the setting of graphs, each cluster should intuitively be connected: there should be at least one, preferably several paths connecting each pair of vertices within a cluster. If a vertex u cannot be reached from a vertex v, they should not be grouped in the same cluster. Furthermore, the paths should be internal to the cluster: in addition to the vertex set C being connected in G, the subgraph induced by C should be connected in itself, meaning that it is not sufficient for two vertices v and u in C to be connected by a path that passes through vertices in V C but they also need to be connected by a path that only visits vertices included in C . As a consequence, when clustering a disconnected graph with known components, the clustering should usually be conducted on each component separately, unless some global restriction on the resulting clusters is imposed. In some applications one may wish to obtain clusters of similar order and/or density, in which case the clusters computed in one component also influence the clustering of other components. We classify the edges incident on v ∈ C into two groups: internal edges, which connect v to other vertices also in C, and external edges, which connect to vertices that are not included in the cluster C. We have degint (v, C) = |(v) ∪ C|, degext (v, C) = |(v) ∪ V C|, deg(v) = degint (v, C) + degext (v, C). Clearly, degext (v) = 0 implies that C containing v could be a good cluster, as v has no connections outside it. Similarly, if degint (v) = 0, v should not be included in C as it is not connected to any of the other vertices included. It is generally agreed upon that a subset of vertices forms a good cluster if the induced subgraph is dense, but there are relatively few connections from the included vertices to vertices in the rest of the graph.
Clustering 125 A measure that helps evaluate the sparsity of connections from the cluster to the rest of the graph is the cut size c(Ci , V Ci ). The smaller the cut size, the better “isolated” the cluster. Determining when a cluster is dense is naturally done by computing the graph density. We refer to the density of the subgraph induced by the cluster as the internal or intra-cluster density, i.e., |{{u, v}|u, v ∈ Ci }| . δint (Ci ) = |Ci |(|Ci | − 1) The intercluster density of a given clustering of a graph G into k clusters C1 , C2 , . . . , Ck is the average of the intercluster densities of the included clusters, i.e., 1 δint (Ci ). k k
δint = (G|C1 , . . . , Ck ) =
i=1
The external or intercluster density of a clustering is defined as the ratio of intercluster edges to the maximum number of intercluster edges possible, which is effectively the cut sizes of the clusters with edges having weight 1. We have {u, v}|u ∈ Ci , v ∈ Cj , i = j δext(G|C1 , . . . , Ck ) = . n(n − 1) − kl=1 (|Cl |(|Cl | − 1)) The internal density of a good clustering should be notably higher than the density of the graph δ(G), and the intercluster density of the clustering should be lower than the graph density. The loosest possible definition of a graph cluster is that of a connected component, and the strictest definition is that each cluster should be a maximal clique. There are two main approaches for identifying a good cluster: one may either compute some values for the vertices and then classify the vertices into clusters based on the values obtained, or compute a fitness measure over the set of possible clusters and then choose among the set of cluster candidates those that optimize the measure used. The measures described below belong to the second category. Firstly, we need to define what we mean by similarity of data points in this context. Consider a number T of points distributed in the two-dimensional plane. In case the connection costs are assumed to be proportional to the length of the links connecting the terminals with a concentrator, the intuition behind clustering of these terminals is to group terminals that are geometrically close into the same cluster. It is convenient to form a similarity matrix for the terminals, whose entries are the reciprocals of the mutual distances between them. Since we do not allow self-loops in an undirected graph, the diagonal elements are set to zero. The choice of similarity is dependent both on application and, to some degree, on the selected method.
126 Chapter 6
6.1 Applications of Clustering Data analysis often studies information that can be categorized along different dimensions. Such dimensions may include independent variables, such as time, location, price, or various degrees of association. Association is any possible dependence between two data objects, such as delay or time lag, distance, price difference, or other relational variables. Due to its close relationship to graph theory, clustering is also used in network design, such as C-RAN and network capacity and resilience planning. In the former case, the goal is to maximize the network performance subject to strict delay and resilience constraints. In the latter case, subnetworks with high capacity or resilience properties can be analyzed. A cluster can be said to have a center of gravity, identified as the point closest to all points based on some distance measure. The clusters’ centers of gravity can be used to assign concentrator or switching facilities in a manner that minimizes the cost of facilities and transportation of traffic.
6.2 Complexity Clustering is N P -hard. The number of possible clusterings of m data points into k clusters is bounded above by k m , which is exponential in m. For example, if m is the number of links in a network, so that m = n(n − 1)/2, and the number of clusters is two, representing present links and absent links, the upper limit is 2n(n−1)/2 . Due to the N P -hardness of the problems involved in network design, decomposition (also known as divide-and-conquer) methods are efficient to reduce problem complexity. The idea behind decomposition is to scale down the problem instances to a level which can be solved with reasonable effort. Since complexity for many graph problems increases exponentially with the order of the graph, decomposition methods yield a large reduction in effort if performed properly. Many of the algorithms presented in the text therefore use decomposition, such as approximations, local search, and randomization algorithms.
6.3 Cluster Properties and Quality Measures Clustering is the process of grouping data objects together based on some measure of similarity between these objects. In network design, this similarity is usually related to cost and can therefore be translated into geographical distance between terminals and concentrators and the amount of traffic they may carry. Roughly speaking, the more traffic a cluster may carry per area unit, the denser it is and the higher the quality it possesses. At the same time, the distance
Clustering 127 to other clusters should be large, so that the number of concentration devices is minimized. It is therefore desirable to measure the cluster quality in an efficient way. Even if the quality of clusters can be considered as a rather subjective matter that is very dependent on the application, there are some general measures that may be used for the evaluation of a decomposition. In general terms a cluster should be dense within a cluster and sparse between clusters. It is illuminating to use the analogy with graphs to illustrate cluster quality. The data points are represented by the vertices V in a graph G = (V , E), and the similarity between the data points is represented by the lengths of the edges E connecting the vertices. Note that the edge lengths need not be restricted to a two- or three-dimensional space, and so it may not be possible to visually depict the graph in a low-dimensional space. We use the following terminology to describe clusters. Let G = (V , E) be a connected, undirected graph with |V | = n, |E| = m and let C = (C1 , . . . , Ck ) be a partition of V . We call C a clustering of G and Ci clusters; C is called trivial if either k = 1 or all clusters Ci contain only one element. We often identify a cluster Ci with the induced subgraph of G, that is, the graph G[Ci ] = (Ci , E(Ci )), where E(Ci ) = {{v, w} ∈ E : v, w ∈ Ci }. Then E(C ) = ∪ki=1 E(Ci ) is the set of intracluster edges, denoted m(C ), and E \ E(C ) the set of intercluster edges, denoted m( ¯ C ). A clustering C = (C, V \ C) is the cut of G, and m( ¯ C ) is the size of the cut. The clustering problem can formally be stated as follows. Given an undirected graph G = (V , E), a density measure δ(·) defined over vertex subsets S ⊆ V , a positive integer k ≤ n, and a rational number η ∈ [0, 1]. Is there a subset S ⊆ V such that |S| = k and the density δ(S) ≥ η? Note that simple maximization of any density measure without fixing k would result in choosing any clique. This fact shows that computing the density measure is N P -complete as well, since for η = 1 it coincides with the maximum clique problem.
Vertex Similarity Central to clustering is the distance (in some sense) between points, or rather its reciprocal – their similarity. A distance measure dist(di , dj ) between two points di and dj is usually required to fulfill the following criteria: (1) dist(di , di ) = 0, (2) dist(di , dj ) = dist(dj , di ) (symmetry), (3) dist(di , dj ) ≤ dist(di , dk ) + dist(dk , dj ) (triangle inequality). For points in n-dimensional Euclidean space, possible distance measures between two data points di = (di,1 , di,2 , . . . , di,n ) and dj = (dj,1 , dj,2 , . . . , dj,n ) include the Euclidean distance
128 Chapter 6 (L2 norm) dist(di , dj ) =
n
(di,k − dj,k )2 ,
k=1
the Manhattan distance (L1 norm), dist(di , dj ) =
n
|di,k − dj,k |,
k=1
and the L∞ norm, dist(di , dj ) = max |di,k − dj,k |. k∈[1,n]
Possibly the most straightforward manner of determining whether two vertices are similar using only the adjacency information is to study the overlap of their neighborhoods in G = (V , E): a straightforward way is to compute the intersection and the union of the two sets ω(u, v) =
|(u) ∩ (v)| , |(u) ∪ (v)|
arriving at the Jaccard similarity. The measure takes values in [0, 1]; it is zero when there are no common neighbors and one when the neighbors are identical. Another measure is the Pearson correlation of columns (or rows) in a modified adjacency matrix C = AG + I (the modification simply forces all reflective edges to be present). The Pearson correlation is defined for two vertices vi and vj corresponding to the columns i and j of C as n n k=1 (ci,k cj,k ) − deg(vi ) deg(vj )
. deg(vi ) deg(vj )(n − deg(vi ))(n − deg(vj )) This value can be used as an edge weight ω(vi , vj ) to construct a symmetric similarity matrix. In a graph, closeness can be seen as the degree of connectivity, that is, the number of edgedisjoint paths that exist between each pair of vertices. With this metric, vertices belong to the same cluster if they are highly connected to each other. It is, however, not necessary that two vertices u and v belonging to the same cluster are connected by a direct edge if they are connected by a short path. Therefore, a similarity matrix can be based on the distance between each vertex pair, where a short distance implies a high degree of similarity. We can use a threshold k of the path length, so that similar vertices must be at distance at most k from each other. Such a subgraph is called a k-clique.
Clustering 129 If we require that the induced subgraph be a k-clique, this implies that the k-shortest paths connecting the cluster members must be restricted to intracluster edges only. The threshold k should be compared with the diameter, the maximum distance between any two nodes in the graph. A threshold close to the diameter may lead to too large clusters, whereas too small values of the threshold k may force splitting of natural clusters.
Expansion Starting the reasoning from the opposite end, a well-performing clustering algorithm assigns similar points to the same cluster and dissimilar points to different clusters. Expressing the clustering as a graph, points within the same cluster induce low-cost edges, and points that are farther apart induce high-cost edges. We therefore interpret the clustering problem as a node partitioning problem on an edge-weighted complete graph. In the graph, the edge weight auv then represents the similarity of vertices u and v. Associated with the graph is an n × n symmetric matrix A with entries auv . We shall assume that the auv are nonnegative. The quality of a clustering can be described by the size (weight) of a cut relative to the sizes of the clusters it creates. The expansion measures the relative cut size of a partitioned graph. The expansion of a graph is the minimum ratio of the total weight of edges of a cut to the ¯ is number of vertices in the smaller part separated by the cut. The expansion of a cut (S, S) defined as i∈S,j ∈S / aij ϕ(S) = . ¯ min(|S|, |S|) We say that the minimum expansion of a graph is the minimum expansion over all the cuts of the graph. A measure of the quality of a cluster is the expansion of the subgraph corresponding to this cluster. The expansion of a clustering is the minimum expansion of one of the clusters. Expansion gives equal importance to all vertices of the given graph, which may lead to a rather strong requirement, particularly for outliers.
Coverage The coverage of a graph clustering C is defined as coverage(C ) =
m(C ) , m
where m(·) is the number of intracluster edges. Intuitively, the larger the value of the coverage, the better the quality of a clustering C . Notice that a minimum cut has maximum coverage. However, in general a minimum cut is not considered to be a good clustering of a graph.
130 Chapter 6
Performance The performance of a clustering C is based on a count of the number of “correctly assigned pairs of nodes” in a graph. It computes the fraction of intracluster edges together with nonadjacent pairs of nodes in different clusters of the set of all pairs of nodes, i.e., m(C ) + {v,w}∈E,v∈C 1 / i ,w∈Cj ,i=j . performance(C ) = 1 2 n(n − 1) Alternatively, the performance can be computed as 2m(1 − coverage(C )) + ki=1 |Ci |(|Ci | − 1) . 1 − performance(C ) = n(n − 1)
Conductance The conductance of a cut compares the size of a cut and the number of edges in either of the two cut-separated subgraphs. The conductance φ(G) of a graph G is the minimum conductance over all cuts of G. The conductance actually allows defining two measures – the quality of an individual cluster (and therefore of the overall clustering) and the weight of the intercluster edges, providing a measure of the cost of the clustering. The quality of a clustering is given by two parameters: α, the minimum conductance of the clusters, and ε, the ratio of the weight of intercluster edges to the total weight of all edges. The objective is to find an (α, ε) ¯ in G is denoted clustering that maximizes α and minimizes ε. The conductance of a cut (S, S) by i∈S,j ∈S / aij , φ(S) = ¯ min(a(S), a(S)) where a(S) = a(S, V ) = i∈S j ∈V aij . The conductance of a graph is the minimum conductance over all the cuts in the graph, i.e., φ(G) = min φ(S). S⊆V
In order to quantify the quality of a clustering we generalize the definition of conductance further. Take a cluster C ⊆ V and a cut (S, C S) within C, where S ⊆ C. Then we say that the conductance of S in C is i∈S,j ∈C S aij . φ(S, C) = min(a(S), a(C S))
Clustering 131 The conductance of a cluster φ(C) will then be the smallest conductance of a cut within the cluster. The conductance of a clustering is the minimum conductance of its clusters. We then obtain the following optimization problem. Given a graph and an integer k, find a k-clustering with the maximum conductance. There is still a problem with the above clustering measure. The graph might consist mostly of clusters of high quality and a few points that create clusters of poor quality, and this leads to the notion that any clustering has a poor overall quality. One way to handle this is avoiding to restrict the number of clusters, but this could lead to many singletons or very small clusters. Rather than simply relaxing the number of clusters, we introduce a measure of the clustering quality using two criteria – the minimum quality of the clusters, α, and the fraction of the total weight of edges that are not internal to the clusters, ε. Definition 6.3.1 ((α, ε)-partition). We call a partition {C1 , C2 , . . . , Cl } of V an (α, ε)-partition if (1) the conductance of each Ci is at least α; (2) the total weight of intercluster edges is at most an ε fraction of the total edge weight.
•
Associated with this bicriterion is the following optimization problem (relaxing the number of clusters). Given a value of α, find an (α, ε)-partition that minimizes ε. Alternatively, given a value of ε, find an (α, ε)-partition that maximizes α. There is a monotonic function f that represents the optimal (α, ε) pairings. For example, for each α there is a minimum value of ε, equal to f (α) such that an (α, ε)-partition exists. In addition to direct density measures, conductance also measures connectivity with the rest of the graph to identify high-quality clusters. Measures of the “independence” of a subgraph of the vertices of the graph have been defined based on cut sizes. For any proper nonempty subset S ⊂ V in a graph G = (V , E), the conductance is defined as φ(S) =
c(S, V C) . min{deg(S), deg(V S)}
The internal and external degrees of a cluster C are defined as degint (C) = |{{u, v} ∈ E|u, v ∈ C}|, degext (C) = |{{u, v} ∈ E|u ∈ C, v ∈ V C}|. Note that the external degree is in fact the size of the cut (C, V C). The relative density is ρ(C) = =
degint (C) degint (C) + degext (C) v∈C degint (v, C) . v∈C degint (v, C) + 2 degext (v, C)
132 Chapter 6 For cluster candidates with only one vertex (and any other candidate that is an independent set), we set ρ(C) = 0. The computational challenge lies in identifying subgraphs within the input graph that reach a certain value of a measure, whether of density or independence, as the number of possible subgraphs is exponential. Consequently, finding the subgraph that optimizes the measure (that is, a subgraph of a given order k that reaches the maximum value of a measure in the graph) is computationally hard. However, as the computation of the measure for a known subgraph is polynomial, we may use these measures to evaluate whether or not a given subgraph is a good cluster. For a clustering C = (Ci , . . . , Ck ) of a graph G, the intracluster conductance α(C ) is the minimum conductance value over all induced subgraphs G[Ci ], while the intercluster conductance δ(C ) is the maximum conductance value over all induced cuts (Ci , V Ci ). For a formal definition of the different notions of conductance, let us first consider a cut C = (C, V C) of G and define conductance φ(C) and φ(G) as follows: ⎧ C ∈ {Ø, V }, ⎪ ⎨ 1, 0, C∈ / {, V } and m( ¯ C ) = Ø, φ(C) = ⎪ m( ¯ C ) ⎩ , otherwise, min(
φ(G) =
v∈C
deg(v),
v∈V C
deg(v))
min φ(C).
C⊆V
Then a cut has low conductance if its size is small relative to the density of either side of the cut. Such a cut can be considered as a bottleneck. Minimizing the conductance over all cuts of a graph and finding the according cut is N P -hard, but it can be approximated with polylogarithmic approximation guarantee in general, and constant guarantee for special cases. Based on the notion of conductance, we can now define intracluster conductance α(C ) and intercluster conductance δ(C ). We have α(C ) =
min φ(G[Ci ]),
i∈{1,...,k}
δ(C ) = 1 − max φ(Ci ). i∈{1,...,k}
In a clustering with small intracluster conductance there is supposed to be at least one cluster containing a bottleneck, that is, the clustering is possibly too coarse in this case. On the other hand, a clustering with small intercluster conductance is supposed to contain at least one cluster that has relatively strong connections outside, that is, the clustering is possibly too fine. To see that a clustering with maximum intracluster conductance can be found in polynomial time, consider first m = 0. Then α(C ) = 0 for every nontrivial clustering C , since it contains at least
Clustering 133 one cluster Cj with φ(G[Cj ]) = 0. If m = 0, consider and edge {u, v} ∈ E and the clustering C with C1 = {u, v} and |Ci | = 1 for i ≥ 2. Then α(C ) = 1, which is a maximum. Intracluster conductance may exhibit some artificial behavior for clusterings with many small clusters. This justifies the restriction to clusterings satisfying certain additional constraints on the size or number of clusters. However, under these constraints, maximizing intracluster conductance becomes an N P -hard problem. Finding a clustering with maximum intercluster conductance is N P -hard as well, because it is at least as hard as finding a cut with minimum conductance. Although finding an exact solution is N P -hard, the algorithm presented in [54] is shown to have simultaneous polylogarithmic approximation guarantees for the two parameters in the bicriterion measure.
6.4 Heuristic Clustering Methods Heuristic methods do not deliver a result with any guarantee, but they are often built on straightforward principles and therefore easy to modify. We discuss the k-nearest neighbor and the k-means algorithms, which are used not only in clustering, but also as subroutines in many other algorithms.
k-Nearest Neighbor A conceptually simple and powerful method that can be used as a clustering technique is the k-nearest neighbor. In graph theory and other combinatorial problems, the nearest neighbor is a much used principle in search applications and greedy algorithms. The k-nearest neighbor simply takes a parameter k, the size of the neighborhood, and a data point p and determines the k data points that in some sense are closest to p. A straightforward way of implementing this is to create a vector p of points pi = p of size k. We can then successively replace a data point pi in the vector by pj whenever the distance d(pj , p) < d(pi , p).
k-Means and k-Median A popular algorithm for clustering data with respect to a distance function is the k-means algorithm. The basic idea is to assign a set of points in some metric space into k clusters by iteration, successively improving the location of the k cluster centers and assigning each point to the cluster having the closest center. The centers are often chosen to minimize the sum-ofsquares of the distances within each cluster. This is the metric used in the k-means algorithm. If the median is used instead, we have the k-median algorithm. Collectively, these are known as centroids.
134 Chapter 6 The method starts with k > 1 initial cluster centers and then the data points are assigned greedily to the cluster centers. Next, the algorithm switches between recomputing the center positions, that is, the centroids, from the data points, and assigning the data points to the new locations. These steps are repeated until the algorithm converges. The choice of k may or may not be given by the problem. Usually, we can form some idea of the magnitude of k, for example from expected cluster sizes. Otherwise, we can either perform trial-and-error to find a suitable k, or resort to methods that determine k as well. Such methods are discussed in Chapter 8. The next step is to estimate initial positions of cluster centers. Again, this can be more or less obvious from the problem itself and the locations can be estimated by inspection. Alternatively, the k center points randomly from the data points. Another heuristic approach is to select the two mutually most distant data points as the two first center points, and subsequent center points as the data points with the largest distance to already chosen center points. The heuristic guarantees a good spread of the center points, but it usually not particularly accurate. The k-means algorithm uses the Euclidean distance to compute the centroids, mi =
1 xj , |Ci | xj ∈Ci
for i = 1, . . . , k and clusters Ci . Algorithm 6.4.1 (Forgy). (0)
(0)
(0)
Given a data set and k initial centroid estimates m1 , m2 , . . . , mk . Let t denote the current iteration. STEP 1:n
Assignment: (t) (t) Construct clusters Ci as Cit = {xp : ||xp − mi ||2 ≤ ||xp − mj ||2 for all 1 ≤ j ≤ k. The data points are assigned to exactly one cluster Ci , and ties are broken arbitrarily. Update: Calculate new centroid locations as mi(t+1) =
1 xj . |Ci | (t) xj ∈Ci
The algorithm has converged when the assignments no longer change. Output C1 , C2 , . . . , Ck .
•
Clustering 135
6.5 Spectral Clustering Spectral clustering is a general technique of partitioning the rows of an n × n-matrix A according to their components in the first k eigenvectors (or more generally, singular vectors) of the matrix. The matrix contains the pairwise similarities of data points or nodes of a graph. Let the rows of A contain points in a high-dimensional space. In essence, the data may have n dimensions. From linear algebra we know that the subspace defined by the k first eigenvectors of A, the eigenvectors corresponding to the k smallest eigenvalues, defines the subspace of rank k that best approximates A. The spectral algorithm projects all the points onto this subspace. Each eigenvector then defines a cluster. To obtain a clustering each point is projected onto the eigenvector that is closest to it in angle. In a similarity matrix, diagonal elements are zero since we do not allow self-loops. We then form the Laplacian L, a matrix with sum of the row elements (also known as the volume) of A on the diagonals. The eigenvectors of the Laplacian L are then used for clustering. Given a matrix A, the spectral algorithm for clustering the rows of L can be summarized as follows. Algorithm 6.5.1 (Spectral clustering). Given an n × n similarity matrix A and its Laplacian L. Find the top k right singular vectors v1 , v2 , . . . , vk of L. STEP 2: Let C be the matrix whose j th column is given by Auj . STEP 3: Place row i in cluster j if Cij is the largest entry in the ith row of C. STEP 1:
Output clustering C1 , C2 , . . . , Ck .
•
We discuss the steps of the algorithm in some more detail.
Similarity Matrices There are several ways to construct a similarity matrix for a given set of data points x1 , . . . , xn with pairwise distances dij or similarities sij = 1/dij (that is, a short distance means strong similarity). This matrix serves as a model of the local neighborhood relationships between the data points, and this neighborhood can be defined using different principles, which may be suggested by the problem at hand. When the relationship between nodes cannot easily be expressed by a single distance measure, the similarity matrix can be defined in alternative ways.
136 Chapter 6 The -neighborhood
Let points belong to the same neighborhood whose pairwise distances are smaller than a threshold . Since the distances between all points in the neighborhood are roughly , the entries aij in a neighborhood are often simply set to the same value, for example sij = 1, and entries representing nodes not in the same neighborhood by zero. k-Nearest neighbors
We let node vi be in the same neighborhood as vj if vj is among the k-nearest neighbors of vi . This relation, however, is not symmetric, and we therefore need to handle cases where vj is in the neighborhood of vi , but vi is not in the neighborhood of vj . The first way to do this is to simply ignore this asymmetry, so that vi and vj are in the same neighborhood whenever vi is a k-nearest neighbor of vj or vj is a k-nearest neighbor of vi . Alternatively, we may choose to let vi and vj be in the same neighborhood whenever both vi is a k-nearest neighbor of vj and vj is a k-nearest neighbor of vi . The entries aij in the similarity matrix between nodes belonging to the same neighborhood can then be set to their pairwise similarity sij and to zero otherwise.
Laplacians Spectral clustering is based on Laplacian matrices, which know different variants. Although we discuss a particular form of matrices, it is beneficial to express the problem in terms of a graph. Let therefore G be an undirected weighted graph, with weight matrix W with entries wij = wj i ≥ 0. Also, let D be the diagonal matrix containing the weighted degrees of the nodes (that is, the sum di = j =i wij ). The (unnormalized) Laplacian is then defined as L = D − W, and has the following important properties. Proposition 6.5.2. For the matrix L the following is true: (1) For every vector f ∈ Rn , f T Lf =
n
wij (fi − fj )2 .
i,j =1
(2) L is symmetric and positive definite. (3) The smallest eigenvalue of L is 0, corresponding to the eigenvector 1. (4) L has n nonnegative, real eigenvalues 0 = λ1 ≤ λ2 ≤ . . . ≤ λn .
Clustering 137 Proof. (1) From the definition of di , we have f Lf = f Df − f Wf = T
T
T
n
di fi2
n
−
i=1
fi fj wij
i,j =1
⎞ ⎛ n n n n 1 1 ⎝ = di fi2 − 2 fi fj wij + dj fj2 ⎠ = wij (fi − fj )2 . 2 2 i=1
i,j =1
j =1
i,j =1
(2) The symmetry of L follows from the symmetry of W and D. The positive semidefiniteness is a direct consequence of (1), which shows that f T Lf ≥ 0 for all f ∈ Rn . (3) This is obvious. (4) Follows directly from (1)–(3). We have the following result, tying together the connectivity of a graph and the spectrum of the associated Laplacian. For a proof see, for example, [55]. Proposition 6.5.3. Let G be an undirected graph with nonnegative weights and L its (unnormalized) Laplacian. Then the multiplicity k of the eigenvalue 0 of L equals the number of connected components G1 , . . . , Gk in the graph. The eigenspace of eigenvalue 0 is spanned by the indicator vectors 1A1 , . . . , 1Ak of those components. As the use of the qualifier “unnormalized” suggests, the Laplacian can also be normalized. The symmetric normalized and the random walk Laplacians are as follows: Lsym = D −1/2 LD −1/2 = I − D −1/2 W D −1/2 , Lrw = D −1 L = I − D −1 W. Similar properties are valid for these Laplacians as for the unnormalized version, but the details are omitted here. The eigenvalues λi and eigenvectors vi of an n × n-matrix satisfy the equation (A − λi I )vi = 0,
vi = 0.
Eigenvectors Most computational packages include routines for eigenvalues and eigenvectors. Alternatively, an easy to use method is the power method. Let z0 be an arbitrary initial vector (possibly random) and calculate zs+1 = Azs /||Azs ||,
s = 0, 1, 2, . . . .
138 Chapter 6 Assuming that A has n linearly independent vectors and a unique (dominant) eigenvalue of maximum magnitude and z0 has a nonzero component in the direction of an eigenvector of the dominant eigenvalue, zs converges to an eigenvector corresponding to this eigenvalue. The dominant eigenvalue is the limiting value of the sequence μs =
zTk Azk zTk zk
.
When the first eigenpair has been found, the eigenpair corresponding to the second-most dominant eigenvector can be found by modifying A0 A, i.e., Ai+1 = Ai − λi vi vTi . We here use the convention of ordering the eigenvalues in increasing order. The first k eigenvectors are therefore the eigenvectors corresponding to the k smallest eigenvalues.
Projection For mapping of data points we can use the k-means algorithm. Selecting the eigenvectors corresponding to the k smallest nonzero eigenvalues, the initial centroid can be taken as the first k elements of the k eigenvectors. The eigenvectors contain the approximate projection (restricted to k dimensions) of the data onto the k centroids. We compute the distance to the centroid for all rows in the eigenvectors and update the centroid coordinates. Iterating until the assignment does not change, we obtain the clustering. Example 6.5.1. Spectral clustering is applied to a data set consisting of 400 geographical objects. The eigenvalues of the Laplacian are computed with the QR-method and the corresponding eigenvectors using the power method with deflation (see [59]), and clusters are formed using Forgy’s k-means algorithm. The clustering in Fig. 6.1 shows well-formed clusters. The colors are reused for different clusters.
6.6 Iterative Improvement Once an approximate minimum cut, generating a bisection, that is, two clusters of approximately equal size, has been found, this approximation can be improved by an iterative improvement algorithm. Let G = (V , E) be a graph. The algorithm attempts to find a partition of V into two disjoint subsets A and B of equal size, such that the sum T of the weights of the edges between nodes in A and B is minimized. Let Ia be the internal cost of a, that is, the sum of the costs of edges
Clustering 139
Figure 6.1: Spectral clustering of geographical data.
between a and other nodes in A, and let Ea be the external cost of a, that is, the sum of the costs of edges between a and nodes in B. Furthermore, let Da = Ea − Ia be the difference between the external and internal costs of a. If a and b are interchanged, then the reduction in cost is Told − Tnew = Da + Db − 2cab , where cab is the cost of the possible edge between a and b. The algorithm attempts to find an optimal series of interchange operations between elements of A and B which maximizes Told − Tnew and then executes the operations, producing a partition of the graph to A and B. Simulated annealing and genetic algorithms have also been used to partition graphs, but some results show that they provide clusters of inferior quality and require much greater computational resources than when using the spectral partitioning algorithm [56].
Uniform Graph Partitioning Iterative improvement is particularly efficient when data consist of two clusters of nearly equal size.
140 Chapter 6 Definition 6.6.1. Given a symmetric cost matrix cij defined on the edges of a complete undirected graph G = (V , E) with |V | = 2n vertices, a partition V = A ∪ B such that |A| = |B| is called a uniform partition. The uniform graph partitioning problem is that of finding a uniform partition V = A ∪ B such that the cost cij C(A, B) = i∈A,j ∈B
is minimal over all uniform partitions. The problem can be thought of as dividing a load into two pieces of equal size so that the weight of the connection between the two pieces is as small as possible. This represents common situations in engineering, such as VLSI design, parallel computing, or load balancing. Suppose (A∗ , B ∗ ) is an optimal uniform partition and we are considering some partition (A, B). Let X be those elements of A that are not in A∗ – the “misplaced” elements – and let Y be similarly defined for B. Then |X| = |Y | and A∗ = (A − X) ∪ Y, B ∗ = (B − Y ) ∪ X. That is, we can obtain the optimal uniform partition by interchanging the elements in set X with those in Y . Definition 6.6.2. Given a uniform partition A, B and elements a ∈ A and b ∈ B, the operation of forming A = (A − {a}) ∪ {b}, B = (B − {b}) ∪ {a} is called a swap. We next consider how to determine the effect that a swap has on the cost of a partition (A, B). We define the external cost E(a) associated with an element a ∈ A by E(a) = dai i∈B
and the internal cost I (a) by I (a) =
j ∈A
daj
Clustering 141 (and similarly for elements of B). Let D(v) = E(v) − I (v) be the difference between external and internal cost for all v ∈ V . Lemma 6.6.1. The swap of a and b results in a reduction of cost (gain) of g(a, b) = D(a) + D(b) − 2dab . The swap neighborhood Ns for the uniform graph partitioning problem is Ns (A, B), that is, all uniform partitions A , B that can be obtained from the uniform partition A, B by a single swap [46].