A spectral clustering-based framework for detecting community structures in complex networks

A spectral clustering-based framework for detecting community structures in complex networks

Applied Mathematics Letters 22 (2009) 1479–1482 Contents lists available at ScienceDirect Applied Mathematics Letters journal homepage: www.elsevier...

566KB Sizes 2 Downloads 108 Views

Applied Mathematics Letters 22 (2009) 1479–1482

Contents lists available at ScienceDirect

Applied Mathematics Letters journal homepage: www.elsevier.com/locate/aml

A spectral clustering-based framework for detecting community structures in complex networks Jeffrey Q. Jiang a,b , Andreas W.M. Dress b,c,∗ , Genke Yang a a

Department of Automation, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China

b

CAS-MPG Partner Institute of Computational Biology, Shanghai Institutes of Biological Science, China Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China c Max-Planck-Institut fuer Mathematik in den Naturwissenschaften, Inselstrasse 22-26, Leipzig D-04103, Germany

article

info

Article history: Received 23 February 2009 Accepted 23 February 2009 Keywords: Community structure Spectral clustering Fuzzy clustering Complex networks

abstract Exploring recent developments in spectral clustering, we discovered that relaxing a spectral reformulation of Newman’s Q -measure (a measure that may guide the search for – and help to evaluate the fit of - community structures in networks) yields a new framework for use in detecting fuzzy communities and identifying so-called unstable nodes. In this note, we present and illustrate this approach, which we expect to further enhance our understanding of the intrinsic structure of networks and of network-based clustering procedures. We applied a variation of the fuzzy k-means algorithm, an instance of our framework, to two social networks. The computational results illustrate its potential. © 2009 Elsevier Ltd. All rights reserved.

1. Introduction Networks have become a popular tool for use in describing complex real-world systems [1–3]. It has been observed that, adopting a rather coarse-grained top-down point of view, methods for associating a suitable community structure with a given network can help to elucidate a network’s intrinsic structure and to reveal its ‘‘global’’ or ‘‘overall’’ organization in terms of its ‘‘modules’’ [4], thus supporting the analysis of social, technological, biological, and many other networks. So far, most approaches [5] proposed towards this end try to iteratively improve the following Q -measure Q (Π ) [6, 7] of a partition Π of the node set V of a simple edge-weighted graph given in terms of a symmetric weight matrix V W = (wuv )u,v∈V ∈ RV≥× 0 into, say, m disjoint subsets V1 , . . . , Vm :

"  2 # m X F (Vj , Vj ) F (Vj , V ) Q (Π ) := − F (V , V ) F (V , V ) j =1

(1)

where F (V 0 , V 00 ) is defined, for any two subsets V 0 , V 00 of V , by F (V 0 , V 00 ) := u∈V 0 ,v∈V 00 wuv . It was illustrated [5–7] that a high Q -value indicates that the partition Π represents a ‘‘good’’ community structure for G and, so, much work has been devoted in recent years to designing methods for finding partitions with high Q -values. In particular, Donetti and Muñoz presented an algorithm in [8] that combines spectral methods with clustering techniques. Their and related approaches (e.g. [9,10]) suggest that considering the map Φe1 ,...,em : V → Rm : v 7→ e1 (v), . . . , em (v) associated with, say, the m highest eigenvectors e1 , . . . , em ∈ RV of some symmetric V × V -matrix derived from W and

P

∗ Corresponding author at: CAS-MPG Partner Institute of Computational Biology, Shanghai Institutes of Biological Science, China Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China. E-mail addresses: [email protected] (J.Q. Jiang), [email protected] (A.W.M. Dress). 0893-9659/$ – see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.aml.2009.02.005

1480

J.Q. Jiang et al. / Applied Mathematics Letters 22 (2009) 1479–1482

clustering the elements in V in terms of their images in Rm may be a promising strategy. Exploring this idea, we noticed that Q -based clustering can be understood as a special instance within the broader family of spectral clustering methods [11, 12]. More specifically, we will show here that the problem of maximizing the modularity measure Q is closely related to an eigenvector problem involving a matrix named the Q -Laplacian, thus linking Q -based clustering procedures to recent developments in spectral clustering. 2. Spectral methods can maximize modularity Indeed, putting volG := F (V , V ), we have vol2G Q (Π ) =

volG F (Vj , Vj ) − F (Vj , V )2 . Thus, re-interpreting Π as

Pm  j=1



a collection of indicator maps – or vectors – pj = pΠ : V → R : v 7→ pjv := j constant factor m X

vol2G ,

volG

1 0

if v ∈ Vj , otherwise,

Q (Π ) coincides, up to the

with

 X

wuv pju pjv −

u,v∈V

j =1

n

X

 !2  !2  m X X T volG pj W pj − wuv pjv  = dv pjv 

u,v∈V

v∈V

j =1 m

=

X

volG pTj W pj − (dT pj )2



j =1

where d = dG ∈ R denotes the degree vector (dv )v∈V := ( V

vol2G Q (Π ) =

m X 

volG pTj W pj − pTj ddT pj =



m X

P

u

wvu )v∈V . Thus,

pTj W − D pj



i=1

j =1

where W := volG W and D := ddT . In consequence, choosing an orthonormal system e1 = (e1v )v∈V , . . . , en = (env )v∈V ∈ RV of n := |V | eigenvectors of the (symmetric!) Q -Laplacian LQ = LQ (G) := W − D of G with corresponding (necessarily Pn Pn T T 2 real!) eigenvalues λ1 ≥ · · · ≥ λn so that p = i=1 hp|ei i ei and p W − D p = p LQ p = i=1 hp|ei i λi holds for every V V vector p ∈ R (where h· · · | · · ·i denotes the standard inner product in R ), we see that Q (Π ) =

m X

pTj

m X n n m X X X W − D pj = hpj |ei i2 λi = hpj |ei i2



j =1

j=1 i=1

i=1

! λi

j =1

holds for every partition Π . of maximizing Q is thus equivalent PnThePproblem Pm to the problem of maximizing the term Q (p1 , . . . , pVm ) := m 2 V ( h p | e i ) λ subject to p ∈ { 0 , 1 } and i i j j i=1 j =1 j=1 pj = 1 holds where 1 = 1V denotes the all-one vector in R . Finding such a partition Π is, of course, NP-complete [11]. While one could use quadratic integer programming, we propose here to derive sufficiently good approximations by relaxing the constraint pj ∈ {0, 1}V by requiring only that Pm pj ∈ [0, 1]V holds for all j ∈ {1, . . . , m} while keeping the condition j=1 pj = 1. This has the additional advantage that it allows us to deviate from the standard approaches that, in a perfectly rigid manner, aim to assign each single node to exactly one community while, in real situations, some nodes may actually belong to more than one ‘‘network module’’. Such nodes that do have strong connections to two or more such modules are sometimes referred to as ‘‘unstable’’ nodes [13,14]. Clearly, the fuzzy partitions introduced above allow us to adopt a ‘‘soft’’ assignment of the nodes to the m clusters. However, even this relaxed problem is difficult to solve exactly (though, again, one could use quadratic programming). Pm Yet, the apparently closely related problem obtained by replacing the two conditions p1 , . . . , pm ∈ [0, 1] and j=1 pj = 1 by the single requirement that the vectors p1 , . . . , pm form a system orthonormal vectors is easy to solve. Indeed, splitting Pof m 0 00 each eigenvector ei (i = 1, . . . , n) into its two components e0i := j=1 hpj |ei i pj and ei := ei − ei relative to the orthogonal decomposition of RV into P the orthogonal sum of the linear span of the orthonormal vectors p1 , . . . , pm and its orthogonal m 0 0 2 complement, we see that j=1 hpj |ei i coincides with the square of the (euclidean) length kei k of the vector ei and that,

 P hpj |ei i2 λi = ni=1 ke0i k2 λi always holds. Thus, noting that also 0 ≤ ke0i k2 ≤ 1 and ! ! n n m m n X X X X X 0 2 2 2 kei k = hpj |ei i = hpj |ei i

therefore,

Pn

i =1

Pm

j =1

i=1

i =1

=

j =1

* n m X X j =1

j=1

hpj |ei iei |

i =1

n X

i =1

+ hpj |ei iei =

i =1

m X hpj |pj i2 = m j =1

we see that Q (p1 , . . . , pm ) = i=1 kei k λi ≤ i=1 λi always holds, and that equality holds if and only if the R-linear span of the vectors p1 , . . . , pm coincides with the R-linear span of some appropriately chosen eigenvectors e1 , . . . , em of LQ (G) with eigenvalues λ1 , . . . , λm (a subspace of RV that is uniquely determined by LQ (G) if and only if λm > λm+1 holds).

Pn

0 2

Pm

J.Q. Jiang et al. / Applied Mathematics Letters 22 (2009) 1479–1482

1481

Fig. 1. The fuzzy community structure and the unstable nodes of the karate-club network as detected by our method. (a) The fuzzy community structure of the karate-club network. (b) The Q -measure varies with the number of the groups m. (c) The nodes’ entropies. The three distinct levels indicate different degrees of instability of these nodes.

3. The resulting strategy for community detection This suggests the following strategy for identifying good (fuzzy or non-fuzzy) partitions for G: Step 1 Spectral mapping: Compute, for some sufficiently large integer M, the M largest eigenvalues λ1 , . . . , λM of LQ (G) and the corresponding eigenvectors e1 , e2 , . . . , eM , for each m, 2 ≤ m ≤ M, form the m × V -matrix Em whose columns are the first m eigenvectors e1 , e2 , . . . , em , and re-scale the (V -indexed) rows (e1v , . . . , emv ) (v ∈ V ) of Em so that, relative to the `2 -norm, unit length. Alternatively, one may represent the all-one vector 1V as a Pn they all haveP P n linear combination 1V = i=1 h1V |ei i ei = i =1 ( v∈V eiv ) ei of the eigenvectors e1 , . . . , en and then consider the 0 (V -indexed) rows of the m × V -matrix Em whose columns are the scalar multiples h1V |e1 ie1 , h1V |e2 ie2 , . . . , h1V |em iem of the first m eigenvectors, either as they are or again re-scaling the rows so that they all have unit length relative to the `2 -norm. Step 2 Fuzzy or non-fuzzy clustering: For each m as above, cluster the elements in V by clustering the corresponding m-dimensional row vectors rv = (rjv )j=1,...,m of Em into m clusters, using your favourite fuzzy or non-fuzzy clustering method. Step 3 Maximizing modularity: Pick that m and the corresponding fuzzy or non-fuzzy partition that maximize Q . In the examples below, we used a variant of the classical ‘‘fuzzy k-means algorithm’’ [15] (with k := m), using a minimization procedure that works by iteratively updating the membership degree pjv of the node v to the fuzzy cluster Vj and its m-dimensional center zj , as well as the corresponding objective function: Js,λ (p1 , . . . , pm , z) :=

m XX

v∈V j=1

psjv krv − zj k2 + λ

m XX

pjv log2 pjv

v∈V j=1

where s ∈ [1, ∞) is a weight exponent controlling the degree P of fuzzification, pjv is the membership degree of the node v in m the cluster Vj that is subject to the conditions pjv ∈ [0, 1] and j=1 pjv = 1, zj is the m-dimensional center of the cluster Vj , λ is a non-negative constant controlling the degree of homogeneity of the vectors pj (viewed as probability distributions), and k · · · k is the `2 norm. Note that the first term in the above objective function is the standard term in the fuzzy k-means algorithm [15] while the second term is a penalty term enhancing the stability of the various nodes (see [13] for the definition of a node’s stability). 4. Numerical examples We investigate the performance of our proposed algorithm by applying it to two famous social networks: Zachary’s karate-club network [16] and the scientific collaboration network [4]. 4.1. Zachary’s karate-club network The first real-life example that we have looked at is the famous ‘‘karate-club’’ network that was first studied in [4] and is widely used as a prototype for testing community-detection methods [5–7,14]. Its 34 nodes represent 34 members; its 78 edges represent friendship relationships. After a disagreement developed between the administrator and the instructor, the club ultimately split into two [16]. The question is whether one can predict the two or multiple communities and, in particular, find the unstable nodes by using a simple weighted version of the network and our algorithm. Fig. 1(a) shows the community structure found by our method. In order to enhance our understanding, we also investigated the so-called unstable nodes [13,14]. To this end, we propose to measure the stability of a node in terms of its entropy

Ξi = −{pij∗ log2 pij∗ + (1 − pij∗ ) log2 (1 − pij∗ )}

(2)

J.Q. Jiang et al. / Applied Mathematics Letters 22 (2009) 1479–1482

a

c

b 0.7 0.6

0.7

d 0.8 0.7

0.5

0.6

0.6

0.5

0.5

Entropy ε

Mathematic Ecology

Q-measure

Structure of RNA

0.4 0.3 0.2

0 0 Agent-based Models

0.4 0.3 0.2

0.1 Statistical Physics

0.8

Entropy ε

1482

4

6

8

10

12

community number m

14

16

0

0.3 0.2

0.1 2

0.4

0.1 0

20

40

60 Nodes

80

100

120

0

0

20

40

60

80

100

120

Nodes

Fig. 2. The scientific collaboration network. (a) The community structure: different colors (shapes) indicate the 6 / 3 communities that we found. (b) The Q -measure as a function of the number of groups. If m = 6 and m = 3, Q attains its largest and its second-largest value, respectively. (c/d) The nodes’ entropies in the case m = 6 / m = 3.

where j∗ = maxj pij . Obviously, the nodes with larger entropy must be less stable. The entropy of each node is given in Fig. 1(c). Our results are in conformity with Zachary’s report [16] (see the supplementary materials for more discussions). 4.2. The scientific collaboration network The collaboration network of the scientists at the Santa Fe Institute was first introduced by Girvan and Newman [4] and further examined in [6,7,14]. The 118 vertices in this network represent scientists in this Institute during any part of the calendar years 1999 or 2000 and their collaborators. An edge is drawn between any pair of scientists if they coauthored one or more articles during that period. As Fig. 2(b) demonstrates, the Q -measure achieves its maximum for m = 6 communities which is consistent with [4], but distinct from [14]. Therefore, we also investigated the classification when the Q -measure achieves its second-largest value and the network splits into three groups. Fig. 2(a) shows the community structure detected by our algorithm. The three different shapes of nodes indicate the resulting three groups while different colors of nodes describe the six communities. Our method could not only reveal its overall organization, but also uncover unstable nodes which may belong to more than one community and constitute the soft boundaries of the communities. Comparing our results with those of Girvan and Newman [4], some interesting points came up that are discussed in the supplementary materials, Acknowledgements Jeffrey Jiang is grateful to Peter Florian for useful discussions and comments, and to Mu Li for help with programming. Andreas Dress thanks the BMBF/Germany, the Chinese Academy of Sciences, the German Max Planck Gesellschaft, the Institute of Advanced Study at the University of Warwick, and the Allan Wilson Centre in NZ for their support. Appendix. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.aml.2009.02.005. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]

S.H. Strogatz, Exploring complex networks, Nature 410 (2001) 268–276. M.E.J. Newman, The structure and function of complex networks, SIAM Rev. 45 (2) (2003) 167–256. A.L. Barabási, The New Science of Networks, Perseus Publishing, Cambridge, MA, 2000. M. Girvan, M.E.J Newman, Community structure in social and biological networks, Proc. Nat. Acad. Sci. USA 99 (2002) 7821–7826. M.E.J. Newman, Detecting community structure in networks, Eur. Phys. J. B 38 (2) (2004) 321–330. M.E.J Newman, M. Girvan, Finding and evaluating community structure in networks, Phys. Rev. E 69 (2004) 026113. M.E.J. Newman, Fast algorithm for detecting community structure in networks, Phys. Rev. E 69 (2) (2004) 066133. L. Donetti, M.A. Muñoz, Detecting network communities: A new systematic and efficient algorithm, J. Stat. Mech. (2004) P10012. A. Pothen, H. Simon, K.-P. Liou, Partitioning sparse matrices with eigenvectors of graphs, SIAM J. Matrix Anal. Appl. 11 (1990) 430–452. R. Kannan, S. Vempala, A. Vetta, On clusterings: Good, bad and spectral, J. ACM 51 (3) (2004) 497–515. A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: Analysis and an algorithm, in: Advance in Neural Information Processing systems, 14, 2001. D. Verma, M. Meilă, A comparison of spectral clustering algorithms, Technical Report, UW CSE Technical Report 03—05—01, 2003. D. Gfeller, J. Chappelier, P.D.L. Rios, Finding instabilities in the community structure of complex networks, Phys. Rev. E 72 (2005) 056135. S. Zhang, R. Wang, X. Zhang, Uncovering fuzzy community structure in complex networks, Phys. Rev. E 76 (2007) 046103. J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981. W.W. Zachary, An information flow model for conflict and fission in small groups, J. Anthropol. Res. 33 (1977) 452–473.