Neurocomputing 74 (2011) 1391–1401
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Spectral clustering with more than K eigenvectors Nicola Rebagliati , Alessandro Verri DISI, Universita di Genova, v. Dodecaneso 35, 16146 Genova, Italy
a r t i c l e i n f o
abstract
Available online 19 February 2011
Spectral clustering techniques are heuristic algorithms aiming to find approximate solutions to difficult graph-cutting problems, usually NP-complete, which are useful to clustering. A fundamental working hypothesis of these techniques is that the optimal partition of K classes can be obtained from the first K eigenvectors of the graph normalized Laplacian matrix LN if the gap between the K-th and the K +1-th eigenvalue of LN is sufficiently large. If the gap is small a perturbation may swap the corresponding eigenvectors and the results can be very different from the optimal ones. In this paper we suggest a weaker working hypothesis: the optimal partition of K classes can be obtained from a K-dimensional subspace of the first M 4 K eigenvectors, where M is a parameter chosen by the user. We show that the validity of this hypothesis can be confirmed by the gap size between the K-th and the M +1-th eigenvalue of LN. Finally we present and analyse a simple probabilistic algorithm that generalizes current spectral techniques in this extended framework. This algorithm gives results on real world graphs that are close to the state of the art by selecting correct K-dimensional subspaces of the linear span of the first M eigenvectors, robust to small changes of the eigenvalues. & 2011 Elsevier B.V. All rights reserved.
Keywords: Spectral clustering Spectral gap
1. Introduction Among the Computer Vision and Machine Learning communities spectral clustering techniques have spread as an effective tool for image segmentation [1], clustering [2] and semi-supervised learning [3]. These techniques represent the input data as vertices of a weighted graph where weights usually encode the similarities between the underlying data. Given this representation, the problem is that of partitioning the graph into K components and to minimize a clustering measure of this partition. We refer to partitions achieving the minimum value as optimal partitions. In this paper we focus on the clustering functional given by the normalized cut. This problem aims at minimizing the sum of cut weights between the K clusters, and at the same time, at maximizing the sum of weights of each cluster, called volume. Minimizing the normalized cut is an NP-complete problem [1], however in some cases we can easily find the desired partition by inspecting the first eigenvectors of the normalized Laplacian matrix LN of the graph. The working hypothesis of many spectral algorithms is that the similarity matrix between data is close to being a K blocks matrix, up to a convenient reordering of the input dataset. When this happens the first K eigenvectors of LN are, up
Corresponding author.
E-mail addresses:
[email protected] (N. Rebagliati),
[email protected] (A. Verri). 0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2010.12.008
to a rotation, close to a weighted version of the indicator vectors of the optimal partition. This assumption is validated by looking at the size of the gap that is the absolute value of the difference between the K+ 1-th and the K-th eigenvalue of LN. If this value is not large we do not have guarantees on the information content of these first K eigenvectors. We first show through a toy example that a small eigengap may cause an undesired swapping of the eigenvectors, thus making it necessary to choose more than K eigenvectors. We give a theoretical justification for choosing M Z K eigenvectors by looking at the eigengap between the M +1-th and the K-th eigenvalue. If this quantity is large the first M eigenvectors contain a large part of the information necessary to recover the solution. Then we propose a simple probabilistic algorithm based on randomly projecting the M eigenvectors in a K-dimensional subspace. We study the probabilistic guarantees of this algorithm and show its stability against small changes in the eigenvalues. The reported experiments on partitioning real world graphs show the algorithm gives very good results, most of the time close to the state of the art, on bipartitioning problems. The literature of machine learning proposes many spectral clustering algorithms, see [4,5,2,6] for examples. All of these consider the first K eigenvectors of the normalized Laplacian matrix, perform some normalization and finally ‘‘round’’ to obtain a clustering. The normalized cut is introduced in [1] together with a recursive spectral clustering algorithm for image segmentation. The Laplacian operator is extensively used in semi-supervised learning and for spectral embeddings [7,8,3,9]. Theoretical and
1392
N. Rebagliati, A. Verri / Neurocomputing 74 (2011) 1391–1401
computational properties of spectral algorithms have been studied in the literature [10–12] and references therein. In particular [11] shows examples in which, in the case of the combinatorial Laplacian and the ratio cut, the second eigenvector leads to a bipartition noticeably different from the optimal bipartition. Spectral partitioning was introduced for cutting graphs in [13,14]. We refer to Section 4 of [15] for a survey. Many recent works try to exploit spectral techniques’ advantages and to fix their problems [16,17]. A probabilistic algorithm was proposed for VLSI design in [18,19], in which one- or two-dimensional random projections called ‘‘probes’’ were used. Empirical evidence shows [12,20,21] that the use of more than K eigenvectors leads to better results. In [12] it is argued that the use of as many eigenvectors as possible improves the approximation of the clustering functional. In [20,21] a relevant subset of more than K eigenvectors is proposed. This paper is an extended version of [22]. The rest of the paper is organized as follows. In Section 2 we set up our notation and review the notion of the normalized cut. In Section 3 we discuss the spectral clustering heuristic along with some examples. In Section 4 we introduce the notion of extended gap for choosing the number of eigenvectors. In Section 5 we analyse a simple probabilistic algorithm. Finally, in Section 6, we present experimental results and draw the conclusions in section 7.
2. The normalized cut functional The purpose of spectral clustering is to find a partition of a graph that minimize a clustering functional like the normalized cut or the ratio cut. Many successful heuristic algorithms for minimizing the normalized cut have been developed [1,5,4,23]. In this section we focus on the normalized cut but similar considerations hold for the ratio cut as well. 2.1. Notation and background Given a dataset and a function for computing the similarity between element pairs of data, the goal is to partition the dataset into disjoint, non-empty, subsets according to some measure of clustering quality. We denote the set of possible partitions of a dataset with cardinality n into K classes as Pn,K . In spectral clustering this partitioning problem is formalized using the language of graphs. The elements of the dataset are seen as a set of vertices V¼{v1,y,vn} of cardinality jVj ¼ n and their pairwise relationships as a set E of edges of a weighted graph G ¼ ðV,EÞ. A value is assigned to each edge through a weighing, or similarity, function w: E-½0,1. Function w is user defined and reflects the user concept of similarity between data. As usual practice, we consider w to be symmetric so it holds w(vi, vj)¼w (vj, vi). A weighted undirected graph G can be represented with a weighted symmetric adjacency matrix W A RjVjjVj so that Wi,j ¼w(vi, vj). We denote the degree of vertex vi as d(vi,W) and the volume of G as volðGÞ. We work with graphs where each vertex has degree strictly greater than zero. The degree matrix D is diagonal with the degrees of each vertex on its diagonal, Di,i ¼d(vi, W) and Di,j ¼0 if i aj. The normalized Laplacian matrix of a weighted undirected graph [24] with weighted adjacency matrix W is LN ðWÞ :¼ ID
1=2
1=2
WD
:
We denote the spectral decomposition of the normalized Laplacian matrix LN ðWÞ ¼ U LU T . The eigenvalues of the normalized Laplacian are diagðLÞ: 0 ¼ l1 r l2 r r ljVj r 2. It is straight-
pffiffiffiffiffiffiffiffiffiffiffiffiffi forward to check that u1 ¼ D1=2 1= volðGÞ is the first eigenvector of the normalized Laplacian where 1A RjVj1 is the vector of ones. For other properties of the normalized Laplacian’s eigenvalues we refer to [24]. We denote the i-th eigenvector, corresponding to li , as ui. The matrix composed of the first 1 rM r jVj columns of U is UM ¼ ½u1 , . . . ,uM A RjV jM , the matrix of zeros with the first M eigenvalues on its diagonal is LM . Sometimes it is convenient to work with matrix D 1/2UM, where vectors D 1/2 um are called harmonic eigenvectors. In spectral clustering the matrix UM, or 1=2 UM LM , is considered as an embedding, or spectral embedding, of the graph’s vertices. The number of columns M is the embedding dimension and each row i of UM is a point that represents vertex vi in this Euclidean space.
2.2. Multi-class normalized cut It is apparent from the graph formalization that we can consider clustering as a graph-cutting problem. Fixed a number K 41, a cut of a graph G is a partition of the vertices into K disjoint S sets A ¼ ðA1 , . . . ,AK Þ A PjVj,K such that k ¼ 1,...,K Ak ¼ V. We can assign to each cut a positive value, the sum of the cut weights K X X wðvi ,vj Þ:
cutðA,WÞ :¼
k¼1
i A Ak j= 2Ak
Minimizing the cut is a possible clustering measure but for different reasons [25,1,26] it is interesting to introduce a normalization. The normalized cut of A is given by ncutðA,WÞ :¼
K X cutðAk ,WÞ , volðAk ,WÞ k¼1
P where volðAk ,WÞ ¼ i A Ak dðvi ,WÞ is the volume value of subset Ak V. The normalized Laplacian matrix and the normalized cut are naturally related. To see this we need the following matrix representation HA A RjVjK [26,27,6] of A: 8 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi > dðvi ,WÞ < if i A Aj volðAj ,WÞ ðHA Þi,j :¼ , ð1Þ > : 0 otherwise from the definition of volume it follows HTA HA ¼I. We reserve the H symbol for matrices that represent partitions, in the sense of Definition 1. Using the same notation, the set of matrices representing partitions is HjV j,K . Through straightforward linear algebra the normalized cut value can be written as the trace of a matrix multiplication ncutðA,WÞ ¼ TrðHAT LN ðWÞHA Þ:
ð2Þ
With this characterization, the problem of minimizing the ncut of a graph G with K classes can be written in the following way: NCutðG,KÞ :¼ min ncutðA,WÞ A A PjVj,K
K X cutðAk ,WÞ volðAk ,WÞ A A PjVj,K k¼1
¼ min
¼ min TrðHAT LN ðWÞHA Þ: HA A HjVj,K
We call this problem (NCut), it is NP-complete [1] and this motivates the research of heuristic algorithms for approximating it. Interestingly, finding the first K eigenvectors of LN(W) can be considered as a ‘‘relaxation’’ of problem (NCut) [1,26] and, indeed, this is the main motivation of spectral clustering algorithms. Denoting An a minimizing partition of problem (NCut), by Fan
N. Rebagliati, A. Verri / Neurocomputing 74 (2011) 1391–1401
inequality [28] we have NCutðG,KÞ ¼ ncutðA ,WÞ ¼ TrðHAT LN ðWÞHA Þ Z
K X
li ,
i¼1
where An is a solution of problem (NCut). Choosing the right number of classes K is often a difficult issue and it is beyond the scope of this work. In the rest of this paper we consider the number of classes K known a priori.
3. Spectral clustering Spectral clustering algorithms are based on a heuristic which, under certain conditions, gives approximate solutions to problem (NCut). We review this heuristic and we discuss an instructive example. 3.1. Rounding Even if minimizing the normalized cut is, in general, difficult, there are cases where it becomes very easy. For example if an input graph G0 consists of K connected components, then the columns of UK are weighted indicator vectors for the K components and it is easy to find the correct solution. Results from matrix perturbation theory [26] say something more: even if we add noise to the weights of the input graph G0 we can still recover the optimal solution HA from UK with rounding techniques, if the gap lK þ 1 lK is sufficiently large. In this sense we refer to the rounding problem [2], for the normalized cut, as follows: Let P A RjV jK ,
P T P ¼ I,
RoundingðP,KÞ :¼ argmaxJHAT PJ2F ,
1393
The two quantities jjRA jj2F and jjEA jj2F can be interpreted as the reconstruction and residual of partition A. Theorem 1 was used in [29], and for K-means in [27], to motivate, when the gap lK þ 1 lK is large and the numerator P ncutðA,WÞ Ki¼ 1 li is small, the rounding problem (3) with P¼UK. In many cases, however, the first K eigenvectors of LN(W) are not representative of an optimal partition. Consequently it may happen that the gap lK þ 1 lK is not significant w.r.t. noise or the minimum normalized cut value is not close to its bound (equivalently, the first K harmonic eigenvectors D 1ui are not piecewise constant). It is possible to build graphs [11] where the second eigenvector represent a bipartition with a much worse cut value than the optimal one. The examples of [11] refer to the combinatorial Laplacian and the ratio cut, but in the Appendix we discuss a straightforward modification to the normalized Laplacian and the normalized cut. The examples in [11] do not seem to be related to typical inputs in pattern recognition problems in which graphs arise from the comparison of points, embedded in a Euclidean space. In the next section we show through an example that similar problems may arise in standard settings.
3.2. A toy example Here we present an example where the gap lK þ 1 lK is different from zero but sufficiently small to cause the instability of the corresponding eigenvectors. We consider two rectangles R1 and R2 and a segment P3 that we define as R1 :¼ fðx,yÞ A R2 j0 r x r0:5,0 r y r1g, R2 :¼ fðx,yÞ A R2 j0:65 rx r 1:15,0 ry r 1g,
ð3Þ
HA A HjVj,K
where K is the target number of classes. An algorithm is a rounding technique if it solves or approximates problem (3), in that case we say that the algorithm rounds P. The statement of this problem appears in [10,2] and usually spectral clustering algorithms, like [4,5,23], are rounding techniques. The key observation is that the subspace spanned by the first K eigenvectors UK of LN(W) may approximate the subspace spanned by the columns of the optimal solution’s matrix representation HA . In this case we want to round UK. The following theorem holds. Theorem 1 (Meil˘a [27] for the normalized cut). Let G ¼ ðV,EÞ be a graph with positive symmetric weights W. Let the normalized Laplacian diagonalization be LN ðWÞ ¼ U LU T and lK þ 1 4 lK . Let the columns of U be divided into two matrices U¼[UK, Ue] where UK A RjVjK is the matrix composed of the first K eigenvectors and Ue A RjVjjVjK the remaining ones. Let a partition A A PjV j,K of V with K classes be represented with matrix HA A RjVjK . Let HA be written using the orthonormal basis of U: " # RA T ½UK ,Ue HA ¼ EA with RA A RKK and EA A RjVjKK . Let P ncutðA,WÞ Ki¼ 1 li eðKÞ :¼ , lK þ 1 lK then jjEA jj2F r eðKÞ: Remark 1. Since jjHA jj2F ¼ K, an equivalent result of Theorem 1 is jjRA jj2F Z KeðKÞ:
P3 :¼ fðx,yÞ A R2 j1:25 r x r3:54,y ¼ 1:03g: We sampled uniformly 100 points within R1, 100 within R2 and 150 within P3 respectively and, for clarity of exposition, we order them in ascending order w.r.t. the x-axis. The sampled dataset is shown in Fig. 1(a). We use the Gaussian similarity between pairs of points xi ,xj : wðvi ,vj Þ :¼ expðJxi xj J2 =0:222 Þ. We choose as reference partition A the one that divides the sampled points according to the three regions R1, R2, P3. The starting question we investigate is whether rounding the first three eigenvectors gives a partition close to A or, alternatively, a better partition with lower normalized cut value. First, we notice that the gap l4 l3 is small and Theorem 1 does not give guarantees to spectral clustering. The spectral algorithm of [5] that roughly corresponds to rounding U3 gives the non-optimal partition B of Fig. 2(a), with normalized cut 0.16. There, the two rectangles are placed in the same cluster and the segment is divided into two clusters. On the other side, the same algorithm using the spectral embedding [u1, u2, u4] in place of U3 gives the reference partition. This unexpected result is due to the swapping between the third and the fourth eigenvectors sensitive to small perturbations of the corresponding eigenvalues. Other datasets, sampled from the same distribution, had these eigenvectors in inverted positions. Similar behavior can be expected in the presence of symmetries in the data which may induce degenerate eigenvalues or very small gaps. We can visually inspect the two embeddings D 1[u2, u3] and D 1[u2, u4] in Figs. 3(a) and (b).1 Using the same symbols and colors of Fig. 1(b) it is clear that in the first embedding D 1[u2, u3], Fig. 3(a), the two rectangles overlap forming one cluster, on the contrary in the second D 1[u2, u4] embedding, Fig. 3(b), they can be separated. 1
We ignore the harmonic eigenvector D 1u1 because it is constant.
1394
N. Rebagliati, A. Verri / Neurocomputing 74 (2011) 1391–1401
1.5
1.5
1
1
0.5
0.5
Cluster 1 Cluster 2 Cluster 3
0
0 0
1
2
0
3
1
2
3
Spectralclustering [5], ncut: 0.16
data
1.5 1.5
1 1
0.5 0.5
Cluster 1 Cluster 1
Cluster 2
Cluster 2 Cluster 3
0
Cluster 3
0 0
0
1
2
3
K-means D
reference ncut: 0.09 Fig. 1. A segment in the data may hide the optimal solutions. (a) Data to be clustered and (b) the reference solution A. The similarity between points xi, xj is wðvi ,vj Þ :¼ expðJxi xj J2 =0:222 Þ.
It is natural to ask if, in this case, the reference partition A is a minimizer of the normalized cut and where, among the eigenvectors, we should look for a possibly better partition. The bound we introduce in Section 4 aims at bounding the number of candidate eigenvectors.
3.3. Choosing s in the toy example It is an open and difficult problem in spectral clustering to choose a meaningful s when using the Gaussian similarity. Theoretical work in [30] shows a bias-variance trade off due to the sampling process and noise on the manifold underlying the data. In the toy example we chose, by visual inspection of the eigenvalues, s ¼ 0:22. Since there is a margin between the distributions, there exist smaller s’s such that the three distributions are almost disconnected and can be recovered using spectral clustering, but in such case the whole graph is poorly connected with many eigenvalues concentrated around 0, a situation that we do not want to encounter in practice. We also note that by joining the two regions R2 and P3 the same eigenvector swap
1
2 −1/2
3
U9,ncut: 0.59
Fig. 2. Other two clusterings of the dataset in Fig. 1(a). (a) Solution given by Ng–Jordan–Weiss algorithm [5] and (b) clustering solution of K-means, 104 repetitions, on the embedding given by the first nine harmonic eigenvectors D 1/2 U9.
occurs even with small s’s, although with a much less difference between partitions A and B. Perhaps unexpectedly, choosing a larger s, s ¼ 1, gives comparable results to s ¼ 0:22. Other approaches [23] propose to use different si , one for each point xi, depending on a simple data statistic, like the distance of the ith point from its k-th nearest neighbor. In the toy example we tried k¼7, 30, but for both cases we have the undesired swapping of the eigenvectors.
4. An extended gap We generalize Theorem 1 by introducing the notion of extended gap lM þ 1 lK , where M ZK is a manually chosen parameter. Parameter M gives the number of eigenvectors that should be chosen for the spectral embedding and the extent to which this embedding is satisfactory. The following theorem holds: Theorem 2 (Extended gap bound). Let G ¼ ðV,EÞ be a graph with positive symmetric weights W. Let the normalized Laplacian diagonalization be LN ðWÞ ¼ U LU T and M Z K a parameter such that
N. Rebagliati, A. Verri / Neurocomputing 74 (2011) 1391–1401
complexity of the problem. Thus the extended gap bound may guide us in choosing such number M. However the exact choice of M depends on the problem at hand, the algorithmic requirements and computational resources. We note that the extended gap bound is quite general and even if [32] gives a graph on which it is tight there are other graphs, like those of [11], on which it is uninformative. For practical purposes it would be interesting to specialize it on different graph classes. In the experiments, with K ¼2, 3, we manually set M¼10 obtaining very good results. Another general observation, pertaining our toy example, is that a large spectral gap lM þ 1 lK avoids the possible instabilities related to a small gap lK þ 1 lK , thus leading to more stable and reproducible results.
0.01
Cluster 1 Cluster 2
0.005
Cluster 3 0
−0.005
−0.01 −4
−2
0
2
4 x 10
−3
4.1. Applying the extended gap to the toy example In the example of Section 3.2 the gap l4 l3 is small, but the extended gap l10 l3 , for example, is large and indeed we can use the extended gap bound with M¼9 and the normalized cut value¼0.09 of the reference partition A. Thus by Remark 2 we obtain a very large reconstruction, JHAT U9 J2F Z3eð9Þ ¼ 2:85, of a partition An that minimizes problem (NCut). In this way we can substantially find in the first nine eigenvectors the representation of those partitions with a normalized cut equal or smaller 0.09. For example, four eigenvectors appear to be sufficient for reference partition A (Fig. 4).
0.01
Cluster 1 Cluster 2
0.005
1395
Cluster 3 0
−0.005
−0.01 −4
−2
0
2
4 x 10−3
λ
Fig. 3. (a) Spectral embedding given by D 1[u2, u3] and (b) spectral embedding given by D 1[u2, u4]. Symbols and colors refer to the reference partition A of Fig. 1(b).
lM þ 1 4 lK . Let the columns of U be divided into two matrices U¼ [UM, Ue] where UM A RjVjM is the matrix composed of the first M eigenvectors and Ue A RjVjjV jM the remaining ones. Let a partition A A PjVj,K of V with K classes be represented with matrix HA A RjV jK . Let HA be written using the orthonormal basis of U: " # RA ½UM ,Ue T HA ¼ EA
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
1
2
3
4 5 6 7 Top ten eigenvalues
9
10
4 (upper) estimated ε(M) 2 (lower) ||E || A F
3
(M)
with RA A RMK and EA A RjVjMK . Let P ncutðA,WÞ Ki¼ 1 li eðMÞ :¼ , lM þ 1 lK
8
2
then
ð4Þ
jjEA jj2F r eðMÞ:
ð5Þ
1
Remark 2. Since jjHA jj2F ¼ K an equivalent result of Theorem 2 is jjRA jj2F ¼ JHAT UM J2F ZKeðMÞ.
0
See Appendix for a proof.
Theorem 2 is a bound on the quantity eðMÞ, the residual of a partition A that we would like to be small enough to allow spectral clustering recover the solution. We refer to Theorem 2 as extended gap bound. A trivial choice would be M ¼ jVj, as suggested in [12,31], but it is possible that a much smaller number of eigenvectors suffices, reducing the computational
5
6
7
8 M (M)
9
10
11
Fig. 4. (a) First 10 eigenvalues the normalized Laplacian LN(W) of the dataset in Fig. 1(a). The gap l4 l3 is not informative for the presence of three clusters whereas the extended gap l10 l3 is more significant. (b) is the behavior of extended gap e ðMÞ of the extended gap bound with problem (NCut) estimated with the cut of the reference partition A, 0.09.
1396
N. Rebagliati, A. Verri / Neurocomputing 74 (2011) 1391–1401
5. A probabilistic algorithm In this section we analyse a probabilistic algorithm that rounds the first M Z K eigenvectors into K classes.
columns, see [34] for details. Apart from step 6, the other steps of the algorithm are self-explanatory. Step 6 is useful to exploit the properties of the matrix PQ in order to save computational time. In particular we can expand the trace characterization of the normalized cut and obtain the following inequality:
5.1. The algorithm
TrðPQT L2:M PQ Þ r NCutðG,KÞ þ eðMÞlM :
In Section 4 we saw that choosing M ZK normalized Laplacian eigenvectors is appropriate, but it is not clear if the rounding problem (3) can be formulated with M 4K eigenvectors. A reasonable approach is reducing to the rounding problem (3) by removing a M K-dimensional linear subspace and then applying a rounding technique to the remaining K-dimensional subspace. This can be done by excluding single eigenvectors, as in [20,21], or excluding linear combinations of them. We take the latter approach and we exclude randomly chosen M K-dimensional linear subspaces from the M eigenvectors. We do this for two reasons, firstly it is possible that all of the first M eigenvectors contribute to the final solution, secondly each eigenvector may concur much differently in the final solution. We are given a connected graph G and we want to find a partition A that solves problem (NCut). Briefly stated, the algorithm we propose tries to directly sample matrix HA A RjV jK . In practice we can rely on sampling a much smaller matrix PQ A RM1K1 ,PQT PQ ¼ I that arises from rewriting a solution matrix HA from the eigenvectors of LN(W). Details are in Appendix. The algorithm, whose pseudocode is shown in Algorithm 1, takes as input W, a number of target clusters K and a number M of eigenvectors to be considered. The first M eigenvectors and eigenvalues of LN(W) are computed, then the normalized cut value ncutðRoundingð½u1 ,U2:M PN ,KÞÞ is evaluated with max_rep random subspaces PN A RM1K1 , PNT PN ¼ I. The partition with minimum value is returned as a solution.
Before running the algorithm we compute an upper bound EMP_NCUT_UB on problem (NCut) with a heuristic algorithm. This upper bound is used to consider only those matrices PN such that (6) holds (with PN in place of PQ ). The computational advantage is that checking this last condition is faster than computing the input of the rounding step.
Algorithm 1. Rounding into K classes a random combination of the first M eigenvectors of LN(W). Input: the matrix W of a weighted graph G, the number M of eigenvectors to consider, the number of clusters K, EMP_NCUT_UB upper bound on problem (NCut). Parameters: Rounding a rounding algorithm. Output: a partition of the graph in (A1,y,AK) S such that i ¼ 1,...,K Ai ¼ G, Ai \ Aj a i ¼ | 1 Let LN ðWÞ ¼ ID1=2 WD1=2 ¼ U LU T 2 3 4 5 6 7 8 9 10 11 12 13
Let e ðMÞ estimates eðMÞ of Theorem 4 for k ¼1:max_rep Let N A RM1K1 be a matrix whose entries are sampled i.i.d. according to N ð0,1Þ Let PN SN VTN be the thin SVD of N (thus PN A RM1K1 ). if TrðPNT L2:M PN Þ r EMPN CUTU B þ e ðMÞlM A(k)¼ Rounding (U2:M PN, K) Let ncut_valk be Ncut(A(k),W) else Let ncut_valk be 1 Let MinPos be the position of the minimum of ncut_val return (A(MinPos)
In order to sample matrix PN we first sample a Gaussian random matrix, a matrix whose entries are generated by independent normally distributed Gaussian random processes, and then we find a base for the subspace spanned by its columns, for example with the thin SVD [33] procedure. We know PN is uniformly sampled from M1 K1 matrices with orthonormal
ð6Þ
5.2. Analysis The probabilistic algorithm returns a solution An of problem (NCut) if any of the sampled matrix is such that the rounding step gives HA . The set of matrices that lead to optimal results is not necessarily small, but here we want to analyse those matrices that are close to PQ , a matrix directly related to An, of the previous section. Following the result of [34] we measure the distance between two matrices PB ,PC A RM1K1 ,PBT PB ¼ I,PCT PC ¼ I using canonical angles. Since matrices like PB have orthonormal columns that span a linear subspace we refer to those matrices as subspaces. The cosine of the canonical angles y1 , . . . , yK1 A ½0, p=2 between PB and PC is the singular values of PTB PC: cosðym Þ ¼ sm ðPBT PC Þ, where we consider the singular values ordered in descending order si Z si þ 1 . Given PB, PC the probability distribution of the largest canonical angle, whose cosine is the smallest singular value of PTB PC, involves the gamma function Z 1 GðzÞ ¼ t z1 et dt 0
and the hypergeometric function of a matrix argument 2 F1 [34]. The following theorem holds: Theorem 3 (Absil et al. [34]). The probability that the smaller singular value smin ðPBT PC Þ of two random subspaces PB ,PC A RM1K1 , PBT PB ¼ I, PCT PC ¼ I is greater than s A ½0,1, if K oM=2 þ1, is
Pðsmin ðPBT PC Þ Z sÞ G K2 G MK2 þ 1 1 M ðsinðacosðsÞÞÞðK1ÞðMKÞ 2 F1 ¼ G 2 G 2 MK 1 M , ; ; sin2 ðacosðsÞÞIK1 : 2 2 2
We denote the value of Pðsmin ðPBT PC Þ Z sÞ with rðs,M1,K1Þ. Using this probability we can find an explicit lower bound on the probability of approximating the correct solution. The following theorem holds: Theorem 4. Let G ¼ ðV,EÞ be a graph with positive symmetric weights W. Let UM A RjVjM be the matrix composed of the first M 4 K eigenvectors of LN(W) with lM þ 1 4 lK . Let a partition A A PjVj,K be a minimizer of problem (NCut) and be represented with matrix HA A RjVjK . Let eðMÞ be as in the extended gap bound, with eðMÞ A ½0,1. Let PN A RM1K1 , PNT PN ¼ I be a random subspace and d A ½0,1, then 2
Pð:HAT ½u1 ,U2:M PN :F Z ðK1Þð1dÞð1eðMÞÞ þ 1Þ pffiffiffiffiffiffiffiffiffiffi Z rð 1d,M1,minðK1,MKÞÞ:
N. Rebagliati, A. Verri / Neurocomputing 74 (2011) 1391–1401
The proof is given in Appendix. From Theorem 4 we see that both quantities d and eðMÞ determine the precision we are working with. A high precision requires small d and eðMÞ (equivalently, a large value of M).p Clearly ffiffiffiffiffiffiffiffiffiffi for small values of d or large values of M the probability rð 1d,M1,minðK1,MKÞÞ decreases. Thus, for practical computational reasons, we need to choose a trade off between high precision of the algorithm and a small value of M. Explicit computation of r can be done following [34,35]. We end this analysis by remarking that, as usual with probabilistic algorithms, the probability of finding HA increases with the number of repetitions max_rep. Remark 3. Under the conditions of Theorem 4 we can upper bound the probability that the probabilistic algorithm fails to sample an approximation of an optimal solution with max_rep extractions, P is the set of sampled subspaces:
1397
1.5
1 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8
0.5
0 0
0.5
1
1.5
2
2.5
3
3.5
2
Pð8PN A P :HAT ½u1 ,U2:M PN :F rðK1Þð1dÞð1eðMÞÞ þ1Þ pffiffiffiffiffiffiffiffiffiffi r ð1rð 1d,M1,minðK1,MKÞÞÞmaxrep
Fig. 5. Running NJW with eight classes on the toy example gives a clustering which refines both partitions A and B.
6. Experimental results In this section we first revisit the example of Section 3.2 and then we test the probabilistic algorithm on both random graphs and real world graphs from the Graph Partitioning Archive2 with K ¼2. 6.1. Tests on the toy example In Section 4.1 we observed that the first nine eigenvectors should be sufficient to recover the solution of problem (NCut) with three classes on the toy example. We applied the probabilistic algorithm with max_rep ¼104 and M ¼9, obtaining the reference partition A as solution, thus avoiding the instability due to the small gap between consecutive eigenvalues. A possible alternative approach to find the optimal solution would be to run K-means, with K¼3, on the dataset’s embedding given by D 1/2U9. The result is the non-optimal partition C of Fig. 2(b), which has a normalized cut value of 0.59. However, from the K-means point of view, limited to this nine-dimensional spectral embedding, partition C is better than both the reference partition A and the partition B of Example 3.2. Another important observation is that, if in this case we limit problem (NCut) to use only the first nine eigenvectors and eigenvalues, partition B is better than partition A. Technically we have TrðHBT U9 L9 U9T HB Þ o TrðHAT U9 L9 U9T HA Þ. This means that the spectral embedding may not preserve the minimizing solution of problem (NCut). To invert the direction of the inequality and preserve the correct solution, the spectral embedding would need at least 15 eigenvectors. Finally, we notice a relatively large gap between the eighth and ninth eigenvalue. Correspondingly standard spectral clustering, with K¼ 8, applied to the first eight eigenvectors (D 1/2U8) leads to an oversegmentation (see Fig. 5) which refines the reference partition A. By looking at the extended gap we were able to use the proposed method directly with K ¼3, thus avoiding the oversegmentation. 6.2. Tests on random graphs Graphs from Gn,p are symmetric graphs with n vertices where each pair of vertices is connected with probability p. We also consider graphs from Gn,p where each sampled edge has a weight uniformly and i.i.d. sampled from [0,1]. We sampled 100 2
http://staffweb.cms.gre.ac.uk/ wc06/partition/.
Table 1 Results for K¼ 3, 100 different connected random graphs, maximum number of repetition max_rep is 2000. In column ‘‘Alg.1 +loc.’’ and ‘‘ref.+ loc.’’ we report the mean value of the normalized cut for the probabilistic algorithm the reference solution. In column ‘‘MD’’ we report the mean misclassification distance between the probabilistic algorithm’s solution and the reference solution. In column ‘‘improvement’’ we report the number of times the probabilistic algorithm gave a value lower than the reference solution and in column ‘‘ref.’’ we report the reference solution value without the greedy improvement step.
PðWi,j A ½0,1Þ ¼ 5% PðWi,j A ½0,1Þ ¼ 10% P(Wi,j ¼ 1)¼ 5% P(Wi,j ¼ 1)¼ 10%
Alg.1 +loc.
ref.+ loc.
MD
Improvement
ref.
1.21 1.46 1.32 1.53
1.23 1.47 1.34 1.54
0.40 0.43 0.42 0.42
95 100 90 95
1.31 1.54 1.42 1.61
connected graphs with n ¼ jVj ¼ 300 vertices and p ¼5%, 10%. We want a partition with small normalized cut for these graphs with K ¼3 classes. To compare two partitions we use the misclassification distance [36] that consists in finding a labels’ permutation that maximizes the intersection between partition’s elements: MDðB,CÞ :¼
min
1
p A PermðKÞ
K 1 X B \ CpðiÞ , jVj i ¼ 1 i
where Perm(K) is the set of permutations of K elements. Since in this case the bound of the extended gap suggests too large values for M, for computational reasons we fix this value to 7. A greedy local search, implemented as a variant of Fiduccia– Mattheyses [37], has been applied to the output cuts after the rounding step, leading to a significant improvement. As reference result we consider the optimal value obtained by one of [4,5]. We display the obtained results in Table 1. In column ‘‘Alg.1+ loc.’’ we report the normalized cut values found by the probabilistic algorithm with max_rep ¼2000 repetitions. In column ‘‘ref.+ loc.’’ we report the results of the reference algorithm when the solution is refined with the local search. By comparing the values of the two columns ‘‘ref.’’ and ‘‘ref. +loc.’’, it is apparent that the local search improves substantially the results. Since random graphs usually do not have a block structure the average normalized cut found by the probabilistic algorithm is not significantly smaller than the one found by the reference algorithm. However, it is important to note that the partitions found are, on average MD ¼ 41%, substantially different from the ones given by the reference algorithm (see column ‘‘MD’’).
1398
N. Rebagliati, A. Verri / Neurocomputing 74 (2011) 1391–1401
Table 2 Results for the balanced cut on benchmark graphs from the Graph Partitioning Archive, two classes. We repeated 102 times Alg.1 with max_rep¼104 and M ¼10. We report the minimum, maximum and mean balanced cut value achieved. In column ‘‘FMC’’ we report the results of the Fiedler median cut which considers the second eigenvector. In column ‘‘Bench’’ we report the value of the benchmark partition. Column ‘‘MD’’ is the misclassification distance between the benchmark and the minimum value among the 102 repetitions of Alg.1. For graphs with the sign * different minimal solutions were found, in that case we report the farthest distance. Benchmark results are considered for 0% imbalance on 04/03/2010. Graph
Vertices
Edges
Min
Max
Mean
FMC
Bench
MD
data 3elt uk add32 whitaker3 crack fe_4elt2 bcsstk29 4elt cti bcsstk30 bcsstk32 t60k brack2 finan512 fe_rotor 598A fe_ocean m14b auto
2851 4720 4824 4960 9800 10240 11143 13992 15606 16840 28924 44609 60005 62631 74752 99617 110971 143437 214765 448695
15093 13722 6837 9462 28989 30380 32818 302748 45878 48232 1007284 985046 89440 366559 261120 662431 741934 409593 1679018 3314611
189 92 21 11 132 190 130 2863 145 428 6408 5499 91 749 162 2323 2471 548 3993 10778
204 112 28 21 153 215 156 3851 188 537 6819 6803 112 1197 162 3581 3884 784 6425 12654
197 100 26 15 138 203 131 3134 162 481 6475 6201 99 896 162 2695 2623 608 4304 11284
260 117 31 23 136 233 130 2978 194 450 6620 7777 120 830 498 2420 2502 647 4079 13110
189 90 20 11 127 184 130 2843 139 334 6394 4667 79 731 162 2098 2398 464 3836 10117
0.0% 0.7%n 0.2% 6.3% 3.8% 0.8% 1.0%n 2.5% 0.9% 6.1% 0.9% 0.9% 3.5% 0.1% 50.0%n 1.1% 0.6% 2.3% 0.2% 0.9%
This motivates the use of the probabilistic algorithm when, as in clustering, we are interested in the optimal partition but usually not in the optimal value itself. In column ‘‘improvement’’ we show the number of times in which the value found by the probabilistic algorithm is strictly smaller than the one of the reference result for 100 repetitions. From the very high percentage reported, Z 90%, we conclude that the probabilistic algorithm consistently provides improved solutions. 6.3. Tests on the Graph Partitioning Archive The Graph Partitioning Archive is a repository of graphs belonging to real world applications and constitutes an important benchmark for graph-cutting algorithms. It was created as a follow-up of [38]. We focus on finding partitions with minimal cuts and that have the same number of vertices, so that the cardinality of each partition’s element jAj j is r djVj=Ke. With this constraint the problem that we have to solve is the Balanced Cut with K classes: BalancedCutðG,KÞ :¼ min P
K X
AA
jVj,K 8i :jAi j r djV j=Ke
cutðAi ,WÞ:
i¼1
It is possible to specialize the probabilistic algorithm for solving this problem by using the combinatorial Laplacian instead of the normalized Laplacian. The combinatorial Laplacian of a graph G is LC(W) ¼D W. In this case the matrix representation HA pffiffiffiffiffiffiffi of a partition A changes so that iA Aj ) ðHA Þi,j ¼ 1= jAj j and 0 otherwise. Even if the natural functional associated with the combinatorial Laplacian is the ratio cut, partitions minimizing the ratio cut might not have equal cardinalities. However the solutions of problem (BalancedCut) might have a small ratio cut too. Thus, for example, it is still meaningful to use the extended gap for bounding the number of eigenvectors M for the solution of problem (BalancedCut). Minimizing the ratio cut and problem (BalancedCut) are both NP-complete problems [39]. For these experiments we chose K ¼2. As rounding algorithm we compare each entry of a vector with its median value. We do not apply, as in the previous section, any optimizing procedure on
the rounding’s output. Unfortunately the bound of the extended gap, for all graphs of the repository, gives values of M that are not computationally tractable, so for practical reasons we fixed the value to 10. Thus, we consider only those graphs of the archive where we expect to find the current optimal solution in the first T 10 eigenvectors, that is the reconstruction of B is JU10 HB J2F Z 1:85. These are 20 graphs. For each graph we compute the first 10 eigenvectors of LC with the MATLAB function eigs and then we run 100 repetitions of the algorithm with max_rep ¼104. The average running time of the algorithm has been very different from graph to graph and was about 2 h for each graph, using sTM machines with Intel Core Duo Processor 2.33 GHz. As reference algorithm we consider the Fiedler median cut [40] that consists in rounding the second eigenvector of LC. In Table 2 we show the results for the selected graphs. We report the minimum, maximum and mean value of the 100 repetitions. In four cases the minimum value matched the benchmark and in other 11 cases it is within 5% its value. In all cases the Fiedler Median Cut value, which we report in column FMC, has been substantially improved by the algorithm. In column MD we report the misclassification distance of the minimum solution with the benchmark. Some results deserve special attention. Graphs fe_elt2 and finan512 seem to be easier than the others. For some graphs, marked with a in column ‘‘MD’’, we obtained different partitions with the same minimum cut value. This means that different good solutions are available. For graphs cti, bcsstk32, t60k, fe_rotor, fe_ocean the minimum and mean values are not particularly good, in those cases the number of repetitions is not sufficient or the number of chosen eigenvectors should be greater than 10. In general, the optimal solutions use many of the 10 eigenvectors with very different weights each.
7. Conclusions In this paper, we analyse the use of more than K eigenvectors of the normalized Laplacian for spectral clustering with K classes. The standard heuristic for spectral clustering is the existence of a sufficiently large gap between the K-th and the K+1-th eigenvalue.
N. Rebagliati, A. Verri / Neurocomputing 74 (2011) 1391–1401
We present a bound for choosing M Z K normalized Laplacian eigenvectors, where K is the target number of classes. This bound is based upon the gap between the M+1-th and the K-th eigenvalue and ensures stability of the solution. We also proposed a probabilistic algorithm that generalizes eigenvector selection. This is because it accounts for possible situations where the solution partition is represented by a linear combination of all the first M eigenvectors, where each eigenvector contributes differently. The algorithm gives good results, often close or equal to the state of the art, on benchmark graphs for bipartitioning problems. It is worth noticing that in our experiments the number of repetitions max_rep was much smaller than that given by our analysis that is based upon standard results of matrix theory. Despite this the obtained results are very good. A possible explanation is the set of subspaces leading to good solutions is much larger than the one we studied. It would be interesting to characterize this larger set that, however, seems to be highly dependent from the problem at hand. On the other side it is desirable, for practical purposes, to find some modifications of the probabilistic algorithm and reduce the number of required repetitions. Another important area of work is to specialize the bound of the extended gap on different classes of graphs and noise conditions on the edge’s weights.
1399
We can use this facts for proving, given M ZK, that 2 0 13 jV j K X X 2 4li @ ncutðA,WÞ ¼ /hj ,ui S A5 j¼1
i¼1
20
jVj X
¼ l2
4@
i¼2 jVj X
þ
j¼1
2
13 2 A5
/hj ,ui S 0
4ðli l2 Þ@
i¼3
by ðvÞ ¼ l2 ðKÞ þ
K X
jVj X
2
K X
13 2 A5
/hj ,ui S
j¼1
0
4ðli l2 Þ@
i¼3
K X
þ
2
0
4ðli l3 Þ@
i¼4
þ
jVj X
2
0
4ðli l3 Þ@
4@
K X
/hj ,ui S
13
K X
2 A5
/hj ,ui S
j¼1
13 2 A5
by ðvÞ Z l2 ðKÞ þ ðl3 l2 ÞðK2Þ
K X
13 /hj ,ui S2 A5repeat K3 times Z
j¼1
obtain Z
Z
20
j¼1
i¼4
¼
We thank Paolo Albini, Matteo Santoro and Veronica Umanita for useful remarks to a draft of this paper. We thank Massimiliano Pontil and Lorenzo Rosasco for helpful discussions. We are grateful to Marco Ferrante for help with the OurGrid middleware. We are grateful to the reviewers for their comments on the original submission. This work has been partially supported by the EU Integrated Project Health-e-Child IST-2004-027749.
jVj X i¼3
K X
K X
2
i ¼ K þ1
K X
jVj X
li þ
by ðviÞ Z
0
4ðli lK Þ@
i¼2
i¼2
2
K X
0
4ðli lK Þ@
li þ ðlM þ 1 lK Þ
K X
K X
13 /hj ,ui S
2 A5
j¼1
13 2 A5
/hj ,ui S
j¼1
4ðli lK Þ@
i¼2 K X
K X
0
i ¼ Mþ1
ncutðA,WÞ
2
i ¼ K þ1
jV j X
li þ
jVj X
ðli li1 ÞðKi þ1Þ þ
i¼2
Acknowledgments
/hj ,ui S
j¼1
move ðl3 l2 Þ ¼ l2 ðKÞ þðl3 l2 Þ jVj X
13 2 A5
13 2 A5
/hj ,ui S
j¼1 jV j X
20 4@
i ¼ M þ1
K X
13 2 A5
/hj ,ui S
j¼1
li Z ðlM þ 1 lK ÞJEA J2F :
i¼2
Appendix A
The proof of Theorem 1 follows from the extended gap with M¼K.
A.1. Proof of Theorems 1 and 2
A.2. A graph product example
Using the notation of Sections 2.1 and 4 and denoting the j-th column of HA with hj we have that
With the objective of studying the behavior of partitioning spectral algorithms [11] built an example that can deceive any spectral algorithm that uses a fixed number of eigenvectors. Their idea is to build a graph that is the crossproduct of a double binary tree graph (let it be G with m vertices) and a path graph (let it be H with n vertices, see Fig. 6). The graph crossproduct G H is a graph on the vertex set fðu,vÞju A G,vA Hg such that ððu,vÞ,ðu0 ,v0 ÞÞ is in the edge set if and only if either u ¼ u0 and ðv,v0 Þ A EðHÞ or v ¼ v0 and ðu,u0 Þ A EðGÞ. A pictorial example of graph crossproduct is shown in Fig. 6. For the sake of simplicity we consider paths with an even number of vertices. There are two interesting cut of this graph, A and B in Fig. 6, the first cut is made of two halves of n/2 double binary trees, its ratio cut value is 4m=jVj. The second is along the edges that connect the roots of the double binary trees, it has ratio cut value 4n=jVj. The two cuts are depicted in Fig. 6. 1=2 Ref. [11] proved that if m Z 14 and n Z dpkm e the eigenvectors in positions 2,y,k+ 1 are the ones of the path graph in the sense that the corresponding vertices of the n copies of the double trees G share the same value in their eigenvector entries, this value is given by the corresponding entry in the eigenvector of the graph path H. The (k+ 2)-th eigenvector is, following the same reasoning as before, one of the double binary tree graph. This implies that, see [11] for details, we have bipartitions of value
ncutðA,WÞ ¼ trðHAT LN ðWÞHA Þ ¼ trðHAT U LU T HA Þ 2 0 13 jV j K X X 2 A5 4 @ , ¼ li /hj ,ui S i¼1
" HA ¼ ½UM Ue
j¼1
#
RA : EA
We recall some facts
(i) HA A RjVj K, HAT HA ¼ I; (ii) U A RjVj jVj is an orthonormal basis for RjVj , thus P UUT ¼UT U ¼I, and Kj¼ 1 /hj ,u1 S2 ¼ 1; P (iii) by (i) and (ii) it follows that 8i Kj¼ 1 /hj ,ui S2 r 1 (this PJ PK 2 implies i ¼ 2 j ¼ 1 /hj ,ui S r J1); P j 2 (iv) by (i) and (ii) it follows that 8j jV i ¼ 1 /hj ,ui S ¼ 1 (this PjVj PK 2 implies i ¼ 2 j ¼ 1 /hj ,ui S ¼ K1); P PK 2 (v) by (iii) it follows that J r K ) jVj j ¼ 1 /hj ,ui S ¼ i ¼ Jþ1 PjVj PK P P J K 2 2 j ¼ 1 /hj ,ui S j ¼ 1 /hj ,ui S ZKJ; i¼2 i¼2 (vi) l1 r l2 r r ln .
1400
N. Rebagliati, A. Verri / Neurocomputing 74 (2011) 1391–1401
Let An be a solution of problem (NCut). It is not difficult to check that the range of HA always contains u1 so that is possible to find a rotation Q A RKK such that we can rewrite HA with u1 as a column: HA Q ¼ ½u1 ,A0Q . It is now useful to rewrite A0Q A RjVjK1 , whose columns are orthogonal to u1, using eigenvectors [U2:M, Ue]: " # RQ , A0Q ¼ ½U2:M Ue EQ
H G
A
B
where RQ A RM1K1 , EQ A RjVjMK1 . We are interested in the case where eigenvectors U2:M well reconstructA0Q .This happens, 2 for example, when the reconstruction of Qn ðRQ F Þ is close to K 1, condition that can be checked, by Remark 2, using eðMÞ of the extended gap bound: JRQ J2F ¼ JHAT UM J2F 1 Z KeðMÞ1:
GXH Fig. 6. Graph product of the double binary tree graph G and path graph H. A double binary tree graph consists of two complete binary trees with their root connected by an edge. There are two interesting cuts for the crossproduct of graphs: (A) Cuts along the path. (B) Cuts along the trees.
equal or greater than 4m=jVj in positions 2,y,k+ 1 and a bipartition of value 4n=jVj in the (k+2)-th position; when 1=2 n ¼ dpkm e o m the cut given by the second eigenvector is at least quadratically worse than the optimal cut.3 The examples of [11] hold for the combinatorial Laplacian and the ratio cut. Here we show that by adding suitable weighted selfloops to vertices we can use exactly the same results for the normalized Laplacian and the normalized cut. We shall see that in general given a graph G it is possible to build a weighted graph G0 , with self-loops, whose normalized Laplacian LN ðW 0 Þ has the same eigenvectors of the combinatorial Laplacian of G LC(W). Consider W 0 as follows: 8 > < max DðGÞdðvi ,WÞ if i ¼ j, 0 if i a j and i G0 j, ðW Þi,j ¼ 1 > :0 otherwise, where maxDðGÞ is the maximum degree of G. Now the normalized Laplacian LN ðW 0 Þ is the following: 8 G0i,j > > > if i ¼ j, 1 > > dðvi ,WÞ > > < 0 Gi,j ðLN ðW 0 ÞÞi,j ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi if i aj, > > > dðv > j Þdðvi ,WÞ > > > : 0 otherwise: It is straightforward to see that LC ðWÞ ¼ max DðGÞLN ðW 0 Þ. So in particular they have the same eigenvectors and the eigenvalues of LN ðW 0 Þ are 1=max DðGÞ the eigenvalues of LC(W). A.3. Explanation of Algorithm 1 Let G be a symmetric weighted graph with positive weights. Let the normalized Laplacian diagonalization be LN ðWÞ ¼ U LU T and a parameter M Z K. We can divide the columns of U into three matrices: U ¼ ½u1 ,U2:M ,Ue , where U2:M ¼ ½u2 , . . . ,uM A RjVjM1 are the eigenvectors from the second to the M-th and Ue A RjV jjVjM are the remaining eigenvectors.
Using the thin SVD we can compute a base PQ for the columns of RQ . We sample random matrices PN A RM1K1 , with PNT PN ¼ I, and then compute the normalized cut value, ncutðRounding ð½u1 ,U2:M PN ,KÞÞ. This is motivated by the fact that if the sampled matrix PN is closed to PQ then we are applying the rounding algorithm to a matrix close to HA . If d A ½0,1 then we have, see Appendix A.4, JHAT ½u1 ,U2:M PN J2F Z ð1eðMÞÞJPNT PQ J2F þ1: Step 6 of the algorithm arises from the following derivation: TrðPQT L2:M PQ IÞ ¼ TrðPQT L2:M PQ ðIS2Q þS2Q ÞÞ ¼ TrðPQT L2:M PQ S2Q Þ þTrðPQT L2:M PQ ðIS2Q ÞÞ rNCutðG,KÞ þ lM TrðIS2Q Þ ¼ NCutðG,KÞ þ lM eðMÞ: A.4. Proof of Theorem 4 Given the hypotheses of Theorem 4 we shall prove it here. Let A A PjVj,K be a partition solution of problem (NCut) represented as HA A RjV jK . At least one rotation Q A RKK ,Q Q T ¼ Q T Q ¼ I, exists such that HA Q ¼ ½u1 ,A0Q : With A0Q A RjV jK1 . Consider the column subdivision of matrix U ¼ ½u1 ,U2:M ,Ue , with M Z K, we decompose A0Q , as in Appendix A.3 " # RQ A0Q ¼ ½U2:M ,Ue , EQ where RQ A RM1K1 ,EQ A RjVjMK1 . Consider the following facts: (i) In general, if A,B A RKK then smin ðABÞ Z smin ðAÞsmin ðBÞ. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (ii) If eðMÞ r 1 then smin ðRQ Þ Z 1eðMÞ, this comes from rewriting HA : " # RA ¼ ½UM ,Ue T HA , EA "
RA
It happens, for example, for m ¼62, k¼2, n¼ 50.
¼ ½UM ,Ue T ½u1 ,A0Q Q T ,
EA "
RA
#
"
RA
" ¼ ½UM ,Ue
EA
EA 3
#
#
2
1
6 ¼40
T
" u1 ,½U2:M ,Ue
3 " 0 # RQ 7 T : 5Q EQ
RQ EQ
## Q T ,
N. Rebagliati, A. Verri / Neurocomputing 74 (2011) 1391–1401
Thus ½10 R0Q Q T ¼ RA . From that we may rewrite the Frobenius norm of RA and relate it to that of RQ : " 2 " # #2 1 1 0 0 JRA J2F ¼ Q T ¼ 0 RQ 0 RQ F
F
K 1 X 2 ¼ 1 þ RQ F ¼ 1 þ s2i ðRQ Þ Z KeðMÞ: i¼1 2 i ðRQ Þ r1
we have that smin ðRQ Þ Z Since for any i, s pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1eðMÞ: (iii) The thin singular value decomposition of RQ is, assuming that RQ has rank K 1, RQ ¼ PQ SQ VQT , where PQ A RM1K1 , SQ A RK1K1 , VQ A RK1K1 . (iv) If PNpA RM1K1 is p asffiffi step 5 of Algorithm 1 then Pðsmin ffiffiffiffiffiffiffiffiffi ffi ðPQT PN Þ Z 1dÞ Z rðacosð ð1dÞÞ,M1,minðK1,MK1ÞÞ. pffiffi So we have with probability rðacosð ð1dÞÞ,M1,minðK1, MK1ÞÞ 2 " T # u1 JHA ½u1 ,U2:M PN J2F ¼ Q T ½u1 ,U2:M PN 0T AQ F 2 2 3 T u 1 " # T 6 7 T 7 U ¼ Q 6 4 ½RT ,ET 2:M 5½u1 ,U2:M PN T Q A Ue F " #2 0 T 1 ¼ Q ¼ 1 þJRTQ PN J2F 0 RTQ PN F
by ðiiiÞ ¼ 1 þJVQ STQ PQT PN J2F VQ is a rotation ¼ 1þ
K 1 X
s2i ðSTQ PQT PN Þ
i¼1
Z 1þ ðK1Þs2min ðSTQ PQT PN Þ by ðiÞ Z1 þ ðK1Þs2min ðSQ Þs2min ðPQT PN Þ by ðiiÞ, ðivÞ Z1 þ ðK1Þð1eðMÞÞð1dÞ:
References [1] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (1997) 888–905. [2] F.R. Bach, M.I. Jordan, Learning spectral clustering, with application to speech separation, Journal of Machine Learning Research 7 (2006) 1963–2001. [3] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, Journal of Machine Learning Research 7 (2006) 2399–2434. [4] M. Meil˘a, J. Shi, Learning segmentation by random walks, in: NIPS, 2000, pp. 873–879. [5] A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm, in: NIPS, 2001, pp. 849–856. [6] C.H.Q. Ding, X. He, k-Means clustering via principal component analysis, in: ICML, 2004. [7] M. Szummer, T. Jaakkola, Partially labeled classification with Markov random walks, in: NIPS, 2001, pp. 945–952. [8] T. Joachims, Transductive learning via spectral graph partitioning, in: ICML, 2003, pp. 290–297. [9] B. Nadler, S. Lafon, R. Coifman, I.G. Kevrekidis. Diffusion maps—a probabilistic interpretation for spectral embedding and clustering algorithms, 2007. [10] F. Rendl, H. Wolkowicz, A projection technique for partitioning the nodes of a graph, 1995. [11] S. Guattery, G.L. Miller, On the performance of spectral graph partitioning methods, in: SODA, 1995, pp. 233–242. [12] C.J. Alpert, A.B. Kahng, S.-Z. Yao, Spectral partitioning with multiple eigenvectors, Discrete Applied Mathematics 90 (1–3) (1999) 3–26. [13] W.E. Donath, A.J. Hoffman, Lower bounds for the partitioning of graphs, IBM Journal of Research and Development 17 (1973) 420–425. [14] M. Fiedler, A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory, Czechoslovak Mathematical Journal 25 (1975). [15] C.J. Alpert, A.B. Kahng, Recent directions in netlist partitioning: a survey, Integration: The VLSI Journal 19 (1995) 1–81. [16] K. Lang, Fixing two weaknesses of the spectral method, in: NIPS, 2005. [17] R. Andersen, K.J. Lang, An algorithm for improving graph partitions, in: SODA, 2008, pp. 651–660.
1401
[18] J.A. Frankle, R.M. Karp, Circuit placement by eigenvector decomposition, in: ICCAD, 1986. [19] J.A. Frankle, Circuit placement methods using multiple eigenvectors and linear probe techniques. Ph.D. Thesis, EECS Department, University of California, Berkeley, 1987. [20] T. Xiang, S. Gong, Spectral clustering with eigenvector selection, Pattern Recognition 41 (3) (2008) 1012–1029. [21] F. Zhao, L. Jiao, H. Liu, X. Gao, M. Gong, Spectral clustering with eigenvector selection based on entropy ranking, Neurocomputing 73 (10–12) (2010) 1704–1717. [22] N. Rebagliati, A. Verri, A randomized algorithm for spectral clustering, in: ESANN, 2010, pp. 381–386. [23] L. Zelnik-Manor, P. Perona, Self-tuning spectral clustering, in: NIPS, 2004. [24] F.R.K. Chung, Spectral graph theory, in: CBMS Regional Conference Series in Mathematics, vol. 92, AMS, February 1997. [25] Z. Wu, R. Leahy, An optimal graph theoretic approach to data clustering: theory and its application to image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 15 (11) (1993) 1101–1113. [26] U. von Luxburg, A tutorial on spectral clustering, Statistics and Computing 17 (4) (2007) 395–416. [27] M. Meil˘a, The uniqueness of a good optimum for k-means, in: ICML, 2006, pp. 625–632. [28] K. Fan, Maximum properties and inequalities for the eigenvalues of completely continuous operators, Proceedings of the National Academy of Science 37 (November) (1951) 760–766. [29] M. Meila, S. Shortreed, L. Xu, Regularized spectral learning, Technical report, Proceedings of the Artificial Intelligence and Statistics Workshop (AISTATS 05), 2005. [30] A. Singer, From graph to manifold Laplacian: the convergence rate, 2006. [31] I.S. Dhillon, Y. Guan, B. Kulis, Kernel k-means: spectral clustering and normalized cuts, in: KDD, 2004, pp. 551–556. [32] N. Rebagliati, Spectral techniques for clustering, Ph.D. Thesis, 2010. [33] G.H. Golub, C.F. Van Loan, Matrix Computations, Johns Hopkins Studies in Mathematical Sciences, The Johns Hopkins University Press, October 1996. [34] P.-A. Absil, A. Edelman, P. Koev, On the largest principal angle between random subspaces, Linear Algebra Application 414 (1) (2006) 288–294. [35] P. Koev, A. Edelman, The efficient evaluation of the hypergeometric function of a matrix argument, Mathematics of Computation 75 (2006) 833–846. [36] M. Meil˘a, Comparing clusterings by the variation of information, in: COLT, 2003, pp. 173–187. [37] C.M. Fiduccia, R.M. Mattheyses, A linear-time heuristic for improving network partitions, in: 25 Years of DAC: Papers on Twenty-five Years of Electronic Design Automation, ACM, New York, NY, USA, 1988, pp. 241–247. [38] A.J. Soper, C. Walshaw, M. Cross, A combined evolutionary search and multilevel approach to graph partitioning, Journal of Global Optimization, 2004. [39] M.R. Garey, D.S. Johnson, L. Stockmeyer, Some simplified np-complete problems, in: STOC ’74: Proceedings of the Sixth Annual ACM Symposium on Theory of Computing, ACM, New York, NY, USA, 1974, pp. 47–63. [40] L.W. Hagen, A.B. Kahng, New spectral methods for ratio cut partitioning and clustering, 1992.
Nicola Rebagliati received the Laurea degree in computer science and the Ph.D. degree in computer science from the University of Genova, Italy, in 2006 and 2010, respectively. He is currently a postdoctorate with the Computer and Information Science Department, University of Ca’ Foscari, where he is involved in the SIMBAD FET project. His main interests in research are related to spectral graph theory and foundations for clustering.
Alessandro Verri received the Laurea and Ph.D. degrees in physics from the University of Genova, Italy, in 1984 and 1989, respectively. Since 1989, he has been with the University of Genova, where he is now a Professor with the Department of Computer and Information Science. He has been a Visiting Scientist and Professor at the Massachusetts Institute of Technology, Cambridge; INRIA, Rennes, France; ICSI, Berkeley, CA; and Heriot-Watt University, Edinburgh, UK. He has published more than 70 papers on stereopsis, motion analysis in natural- and machine-vision systems, shape representation and recognition, pattern recognition, 3-D object recognition, and learning theory. He is also the coauthor of a textbook on computer vision with Dr. E. Trucco. Currently, he is interested in the mathematical and computational aspects of learning theory with applications in computer vision and computational biology.