Physica A 391 (2012) 1797–1810
Contents lists available at SciVerse ScienceDirect
Physica A journal homepage: www.elsevier.com/locate/physa
Detecting community structure using biased random merging Xu Liu a , Jeffrey Yi-Lin Forrest a,b , Qiang Luo c , Dong-Yun Yi a,∗ a
Department of Mathematics and Systems Science, College of Science, National University of Defense Technology, Changsha 410073, China
b
Department of Mathematics, Slippery Rock University, Slippery Rock, PA 16057, USA
c
Department of Management, College of Information Systems and Management, National University of Defense Technology, Changsha 410073, China
article
info
Article history: Received 12 July 2011 Received in revised form 12 September 2011 Available online 28 October 2011 Keywords: Community detection Agglomerative greedy clustering Data structure Complex networks
abstract Decomposing a network into small modules or communities is beneficial for understanding the structure and dynamics of the network. One of the most prominent approaches is to repeatedly join communities together in pairs with the greatest increase in modularity so that a dendrogram that shows the order of joins is obtained. Then the community structure is acquired by cutting the dendrogram at the levels corresponding to the maximum modularity. However, there tends to be multiple pairs of communities that share the maximum modularity increment and the greedy agglomerative procedure may only merge one of them. Although the modularity function typically admits a lot of high-scoring solutions, the greedy strategy may fail to reach any of them. In this paper we propose an enhanced data structure in order to enable diverse choices of merging operations in community finding procedure. This data structure is actually a max-heap equipped with an extra array that stores the maximum modularity increments; and the corresponding community pairs is merged in the next move. By randomly sampling elements in this head array, additional diverse community structures can be efficiently extracted. The head array is designed to host the community pairs corresponding to the most significant increments in modularity so that the modularity structures obtained out of the sampling exhibit high modularity scores that are, on the average, even greater than what the CNM algorithm produces. Our method is tested on both real-world and computer-generated networks. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Networks are powerful tools for modeling structures, dynamics, robustness, and evolution of complex systems [1–3] considered in biology, sociology, engineering, information science, etc., see Ref. [4] for a more comprehensive review. When networks are applied to model real-world systems, they reveal communities, a mesoscopic feature of these systems [5], with dense internal and sparse external connectivity. Community detection is a process of placing nodes into groups such that the nodes within a group are densely connected to each other and are sparsely connected to the nodes outside the group [6]. Understanding the topology, especially the community structure, of a network is the first step in characterizing the functions of individual nodes for the eventual comprehension of the network’s dynamics. Since the community structure often reflects networks’ functions, such as cycles or pathways in metabolic networks or collections of pages on a single topic on the web [7]. A recent work by Karsai et al. [8] suggests that the community structure is a key factor for understanding the dynamic processes and other features of networks; and the power law distribution and the small world property may not be sufficiently adequate to reveal the mechanism of networks’ dynamic processes.
∗
Corresponding author. Tel.: +86 731 84573261; fax: +86 731 84573265. E-mail addresses:
[email protected] (X. Liu),
[email protected],
[email protected] (D.-Y. Yi).
0378-4371/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.physa.2011.09.028
1798
X. Liu et al. / Physica A 391 (2012) 1797–1810
It is well known that most real-world networks display community structures [2,6,9]. However, in the literature there is no generally accepted definition for what a community or module actually is. Although modularity is a widely used quality for community detection, the behavior and accuracy of this function is not well understood in practical contexts. As pointed out recently by Good et al. [10], the modularity function exhibits extreme degeneracies: its landscape is characterized by an exponential number of distinct partitions of close modularity values; and any partition of a network typically lacks a clear global maximum. High-modularity partitions contain large structural inhomogeneity. That is, one cannot be content with any one partition produced by a completely deterministic algorithm. Deterministic algorithms may be prone to error when there are fluctuations in the network’s structure and when the massive alternative community structure cannot be found. We will see later in this paper, by using random sampling strategies, diversity community structures from the high-score area can be extracted. Community detection is similar to the well studied graph partitioning problem [11], where the network is partitioned into c (a number given in advance) groups while minimizing the number of edges between the groups. However, the number of communities in a network and their sizes are not known beforehand; they must be established by the community detection algorithm. The number of ways of dividing n vertexes into c nonempty communities is a Stirling number of the second kind [12]. Thus, the complexity of global optimal modularity maximization is NP-complete [13]; the method of exhaustive enumeration should be avoided for most networks. Due to this reason, a vast number of heuristic methods for community detection have been developed [9], including repeatedly removing of high-betweenness edges [6], greedy optimization [14], spectral clustering [15], local searching strategy by extracting communities one by one based on connecting degree [16], and many others. Among these methods hierarchical clustering is a prominent approach that does not require any advanced knowledge of the number of partitions of the network. There are two categories of hierarchical methods [9], including the agglomerative algorithms with which clusters emerge iteratively, and the divisive algorithms with which clusters split iteratively. The result is a dendrogram that encodes a trail of community partitions of the network, while reporting the partitions with the greatest modularity as the community structure of the network. The GN algorithm, proposed by Girvan and Newman [6], is a typical divisive algorithm. By recursively removing the current edge with the highest value of betweenness, the network is separated into subgroups. However, the complexity of the GN algorithm is not ideal; it runs in O(m2 n) time on an arbitrary network with m edges and n vertexes, or O(n3 ) time on a sparse network [6]. Because most realworld networks of interest are sparse satisfying n v m, thus the GN algorithm is not capable of handling large real-world networks. The seminal algorithm proposed by Clauset et al. [17] (CNM algorithm, here CNM stands for the names of the authors), is an important agglomerative hierarchical algorithm that accommodates sophisticated data structures and scales of large networks. Its low complexity guarantees the running time O(md log n), where d is the depth of the dendrogram, which describes the community structure. For a sparse and hierarchical network with m v n and d v log n, the complexity of this method is nearly linear O(n log2 n). Latter, Francisco and Oliveira [18] proposed an enhanced implementation of the CNM algorithm by using improved data structures. This new algorithm speeds up the CNM algorithm as much as at least a multiple of two. The running time of this enhanced algorithm is also O(n log2 n) for sparse and hierarchical networks. Before we proceed, we must point out that modularity suffers from some drawbacks. The most famous criticism is its resolution limit [19]. Modularity optimization may fail to identify modules smaller than a scale which depends on the total size of the network and on the degree of interconnectedness of the modules, even in cases where modules are unambiguously defined. The performance is even worse for huge networks [20]. Other partition score functions that make similar random-graph assumptions like modularity about intermodule edges, such as the Potts model [21] and several likelihood-based [22] techniques, also exhibit resolution limits [10]. Another score function called modularity density [23] based on link density does not show from some examples [24] the resolution limit that modularity suffers from. However it suffers from a more serious limitation called misidentification [24]. This means that the detected communities may have sparser connections within them than between them. These results imply that the output of any modularity maximization procedure should be interpreted cautiously in scientific contexts [10]. So the stochastic algorithms which generate diversity results are more suitable for community detection. In this paper, we propose an agglomerative community detection algorithm with several alternative choices for random merging. We introduce a revised heap data structure with an extra head array to keep track of the potential modularity gain of community combinations. Since the largest modularity increments and the corresponding community ids are stored in the head array, communities can be sampled conveniently for merging, using different weighting schemes. When the size of the head array is squeezed to zero, our algorithm becomes the original CNM with no random sampling for pair merging. On the other hand, when the size of the head array grows big enough, all entries will be filled in the head array and our proposed method will degenerate to a total random search so that no optimization is guaranteed. A head array of moderate size assures that the searching path lies approximately along the gradient of the modularity function and a random perturbation will help the searching path to escape local optimal solutions. The resultant community structures typically accommodate higher modularity scores than any of the non-stochastic algorithms. Diversity community structures can be extracted if one runs the algorithm multiple times. This feature stems naturally from the stochastic of the sampling strategies. The rest of this paper is organized as follows. First a brief review of community detection problem and the original CNM algorithm is given in Section 2. Then the proposed stochastic search method is presented in Section 3. Simulation results and parameter analysis are reported in Section 4. And finally, conclusions and comments are drawn in Section 5.
X. Liu et al. / Physica A 391 (2012) 1797–1810
1799
2. Background For the sake of convenience, we focus on simple networks with no repeated edges and that all edges are undirected and unweighted. The community detection of directed [25] and weighted [26,27] networks can be considered similarly and all the details are omitted in this paper. A network of our concern is denoted as G = (V , E ), where V = {V1 , . . . , Vn } is the set of all the n nodes or vertexes, and, E = {Ee | Ee ∈ V × V , e = 1, . . . , m} is the set of all m edges of the network. The adjacency matrix A of G is an indication of the edges,
Aij =
if i and j are connected, i.e. ∃Ee ∈ E , s.t. Ee = (Vi , Vj ) otherwise.
1 0
(1)
More precisely, community detection is to find a disjoint partition C = {C1 , . . . , Cc } the vertex set V such that k=1 Ck = V and Ci Cj = ∅, ∀i ̸= j, such that for each Vi ∈ V (1 ≤ i ≤ n) there exists just one single Ck ∈ C (1 ≤ k ≤ c ) such that Vi ∈ Ck . The modularity is a set function that evaluates the quality of the partition C of vertexes in the network G [6],
c
Q (C ) =
c 1 −
−
Aij −
2m k=1 V ∈C ,V ∈C i k j k
ki kj 2m
.
(2)
kk
∑n
i j Here ki = j=1 Aij is the degree of node Vi . The term 2m is the expected number of links within comparable modules of a random network with the same degree sequence as in the configuration model [28]. Thus, a partition with more edges within each Ck entertains a higher modularity score. Because the brute force search of the global optimal partition has been proved to be NP complete [13], many heuristic optimization methods have been proposed recently [9]. Among these methods the CNM algorithm is the first agglomerative algorithm that accommodates sophisticated data structures and is capable of finding communities in very large networks with thousands of vertexes. The searching path of an agglomerative algorithm, such as the CNM, starts off with the assumption that each vertex belongs to the community that contains only that vertex, C(n) = {C1 , . . . , Cn }, where Ci = {Vi } (1 ≤ i ≤ n). Here the subscript n is used to indicate that there are n disjoint node sets, which are also known as communities in C(n) . The algorithm generates a series of partitions C(n) , C(n−1) , . . . , C(1) , which can be seen as a searching path in the space of partitions. Then the partition Cm = arg max1≤k≤n Q (C(k) ) is returned by the algorithm as the community structure of the network G. In order to perform the search operation efficiently, the CNM algorithm [17,18] maintains a matrix ∆Q of modularity increments produced by all possible community mergers. The elements of ∆Q is contained in a max-heap data structure H, see Fig. 1(a) for an illustration of heap data structures. So the largest entry can be found in constant time and the operations of deletion, insertion and updating can be performed in log(n) time due to the good organization of the data structure [29]. From the very beginning the matrix is initialed as
∆Qij = Aij
1 2m
−
ki kj
4m2
,
∀1 ≤ i, j ≤ n.
(3)
Then populate the max-heap H with the non-zero elements of ∆Q . An auxiliary vector a = (a1 , . . . , an ) is initialed as ai =
ki 2m
,
∀ 1 ≤ i ≤ n.
(4)
At each step, in ∆Q select the largest entry, say ∆Qij, which can be drawn from the top of H in constant time. Then the community pair Ci and Cj in the current partition, say C(k) with k disjoint communities, will be combined to form a new community Ci′ , i.e. Ci′ = Ci Cj . Thus a new partition C(k−1) with k − 1 disjoint communities is obtained by deleting Ci and Cj in C(k) and adding Ci′ , i.e., C(k−1) = C(k) \ {Ci , Cj } {Ci′ }. Update the matrix ∆Q and the vector a as follows. If community k is connected to both i and j,
∆Qik′ = ∆Qik + ∆Qjk .
(5a)
If community k is connected to only one of the two communities, say i, then
∆Qik′ = ∆Qik − aj ak .
(5b)
Then delete the j-th row and column of ∆Q from the max-heap H. All the above updating and deleting operations can be done in log(n) time, where n is the number of existing entries in H. Finally, update a′i = ai + aj and aj = 0. The member vertexes of Ci ∈ C(k) , 1 ≤ i ≤ k, 1 ≤ k ≤ n are stored in a single linked list [17] and the merging operation can be further improved by maintaining references among adjacency lists, thus lead to an enhanced version of the CNM algorithm [18]. It is claimed [18] that this improved community data structures speedup the method by a large factor, for very large networks.
1800
X. Liu et al. / Physica A 391 (2012) 1797–1810
a
b
Fig. 1. (Color online) A sketch diagram for the heap data structure. (a) Max-heap used by the CNM algorithm. Entries of ∆Q are arranged in a complete binary tree, the entry hosted by the root node corresponds to the largest modularity increment and all the sub-roots host the largest values in the sub-trees, respectively. (b) A max-heap with an extra head array of size 2 that hosts two of the most significant modularity increments. The head array is constructed by popping the top element of the max-heap twice, which can be done in log2 n time in the worst scenarios, where n is the current number of elements in the max-heap. And in this example n = 10.
3. Method The key of the CNM algorithm is the adoption of the heap data structure that enables the high-speed community detection. Heap is a complete binary tree [29] that allows efficient implementations of those frequently used operations during the modularity optimization, including maximum query, element inserting, updating, and deleting, etc. Fig. 1(a) shows a simple max-heap data structure, where the element located in the root is the largest element and for any sub-tree, its root hosts the largest node within the sub-tree. Since the depth of the complete binary tree implementing the max heap is at most ⌊log2 n⌋ + 1, all these operations can be done in log2 n time in the worst scenario [29]. More formally, we proposed a revised data container, the head-heap data structure, with head size d, where all the elements hosted by the head array are the biggest elements in the container and the rest n − d elements are organized as a max-heap. As depicted in Fig. 1(b), the elements in the head array corresponding to the most significant modularity increments (1.9 and 1.6, respectively), and the rest are hosted by an ordinary max-heap, the same data structure used by the CNM algorithm [17,18]. All the other information, such as the community id and adjacency relationship, is also stored in the elements of the data container and can be retrieved instantly when needed to. Our algorithm behave in the same way as the CNM algorithm except that we store the non-zero items of ∆Q in the proposed head-heap and pick up community pairs randomly from the head array. Suppose that there are d entries from ∆Q in the head array whose modularity increments are qi , i = 1, . . . , d, respectively. We consider the following sampling methods to pop elements out of the head heap to decide which pair of communities to merge: 1. Choose the pair of communities corresponding to the maximum modularity increment, denoted as max. That is, we do the same as if we use the CNM algorithm. 2. Choose the pair of communities corresponding to qi with the equal probability 1d , denoted as random. q
3. Choose the pair of communities corresponding to qi with probability ∑d i 4. Choose the pair of communities corresponding to qi with probability
j=1 q2 ∑d i j=1
q
qj q2j
, denoted as weight. , denoted as square.
−1
5. Choose the pair of communities corresponding to qi with probability ∑d i
, denoted as inverse.
6. Choose the pair of communities corresponding to qi with probability
, denoted as inversesq.
−1 j=1 qj −2 qi ∑d −2 j=1 qj
The head size d controls the level of randomness in the search path and the sampling strategies, as listed above, represent the biased preference of communities to be merged. The larger the parameter d is, the more likely the pair of communities with lower modularity increments will be merged. When the number of elements is smaller than d, say, there are only k (k < d) elements left after the previous sampling, all elements will be stored in the head array and the sampling procedure will only be applied only on the left k elements. For example, for the weight sampling we just pick the pair of communities q corresponding to qi (1 ≤ i ≤ k) with probability ∑k i . In the above case, the max sampling method is not affected; it j=1 qj
just picks up the pair of communities with the maximum modularity increment. The enhanced version [18] further deals with the scenarios where equal maximums exist by randomly choosing one of them. The other merging methods, as listed above, make use of higher levels of randomness. The random method does not create bias among the d pairs to choose from. The weighted method prefers larger qi but not as much as the square method. The inverse method prefers smaller qi but not
X. Liu et al. / Physica A 391 (2012) 1797–1810
1801
Fig. 2. (Color online) Max-heap popping. 10,000 uniform random numbers are stored in the max-heap with different head sizes (d = 20, 200, respectively). Then pop them off by different sampling methods. (a) Maximum popping (max). (b) Random popping (random). (c) Weighed popping (weight). (d) Square weighted popping (square). (e) Inverse weight popping (inverse). (f) Square inverse weighted popping (inversesq). Here we only report the first 800 sampled numbers because the long range decreasing tendency makes the plots too slim to show details.
as much as the inversesq method. It seems that the inverse and the inversesq methods disagree with the optimizing goal, since they put more weight on the smaller modularity increments. But the samplings are restricted within the head array, which consists of the largest d modularity increments and generally d is much smaller than number of elements in the head-heap, except the final stage when there are very few number of groups. So most sampling will still choose one of the largest modularity increments in the head-heap, which can be seen easily from Fig. 2(e)–(f). As an example we generate 10,000 random numbers uniformly distributed in the interval [0, 1] and inert them in two head-heaps with different head sizes d = 20 and d = 200, respectively. Then we pop them out one by one using the sampling methods discussed above and the first 800 of them are shown in Fig. 2. The plots show that the numbers popped by using different methods experience a long range decreasing tendency and the max in Fig. 2(a) decreases most sharply. Since we pop the biggest number, the head size d makes no difference. The other five sampling methods produce more spikes; and the more the weights assigned to the small values, the sharper the spikes are (ranging from (c) to (f)). The decreasing tendency can be easily explained. Since the numbers in the head array are the largest ones in the head heap. After any number in the head array is popped out, a smaller number extracted from the heap is filled in. So the average of the numbers in the head array will be smaller than before. Another observation is that the bigger the head-array is, the more fluctuations exist in the popped numbers. Since bigger head array means more smaller numbers are potentially sampled, we can see that a smaller d means a more conservative community searching, and a bigger d means a more offensive searching. The parameter d control the randomness of the head heap popping. For example, when d = 20, the first sample is restricted within the first 20 largest numbers in the head heap, and the second sample is restricted within the first 21 largest numbers, and so on. Generally, the k-th sample is restricted within the first d + k − 1 largest numbers. In Fig. 2 we expect the vertical values ranges from 1 to about 0.9, since the numbers are generated uniformly distributed in [0, 1] and the lower bound for the last sampled number is (10,000 − (200 + 800 − 1))/10,000 ≈ 0.9, approximately. However, random searching may be introduced even if d = 0 due to the possible existence of equal maximums in ∆Q . When there are two equal modularity increments, a uniform random sampling is applied by using the enhanced version of the CNM algorithm [18]. However, this random sampling is only a simple remedy for the situation of equal modularity increments. This operation can be easily implemented by randomly shuffling the equal elements in the head-heap and can be done during the operations of insertion, deletion, and updating. So the max sampling method, as proposed here is actually the CNM algorithm [17] with the added shuffling of the equal elements. We summarize our method in Fig. 3, where a tuple is used to record the connections between two communities and the corresponding modularity gain by merging the community pair. The sampling method (one of the above six options) is provided as an input by the user and the head-heap implements the method internally. After a popping operation a pair of communities is merged and the community id is passed to the head heap to update the matrix ∆Q stored inside according to Eqs. (5a) and (5b). Also the auxiliary vector a is updated, which is omitted in Fig. 3. 4. Results In this section, we evaluate the proposed method using some synthetic benchmark networks and real-world networks.
1802
X. Liu et al. / Physica A 391 (2012) 1797–1810
Fig. 3. Pseudo code of the proposed method.
Fig. 4. (Color online) Examples of benchmark networks [30] with different mixing ratios. Parameters: n = 256, kmean = 8, kmax = 15, β = 2.5, γ = 1.5, smin = 32, smax = 48, and the mixing ratio (a) µ = 0.05, (b) µ = 0.30, (c) µ = 0.55. Here the number of nodes n = 256 is chosen not too big for visualization purpose.
4.1. Tests on computer-generated networks We apply our proposed algorithm to the class of benchmark computer-generated networks as proposed by Lancichinetti et al. [30]. This class contains networks of heterogeneous distributions of node degree and community size. Thus these networks pose a much more severe test to community detection algorithms than Newman’s standard benchmark [31]. These networks have a known community structure Co that can be constructed as follows. Both the degree and the community size distributions of any of these networks follow power laws, with exponents β and γ , respectively. The number of nodes is n, the average degree is kmean , the maximum degree is kmax and the minimum degree is kmin . The simulated degree is drawn from the power law distribution within this range. The minimum community size is smin while the maximum community size is smax . Like the degree distribution, the community size of the network is drawn from the power distribution within this range. Each node shares a fraction 1 −µ of its links with the other nodes of its community. Here µ (0 < µ < 1) is the mixing parameter. A bigger µ means fewer links generated within communities and more links between them, leading to a more fuzzy community structure. Fig. 4 shows three networks with different mixing parameters. We can see that as the mixing parameter µ increases from 0.05 to 0.55 with the gap 0.25, the community structures of the simulated networks become weaker and weaker, thus posing greater challenges to the community-finding algorithm used. Denote the community structure detected by algorithms as Ce . Then the normalized mutual information (NMI) [32] measures the degree of the similarity between the true and the estimated community structures. The definition of NMI
X. Liu et al. / Physica A 391 (2012) 1797–1810
1803
Fig. 5. (Color online) Comparison between the normalized mutual information of our proposed algorithm and the enhanced CNM algorithm [18] on the benchmark networks [30]. The head size d = 10. The algorithms are repeated for 100 times and the median of theirs NMI scores are plotted. See main text for parameters setting.
between Co and Ce is given by NMI (Co , Ce ) =
H (Co ) + H (Ce ) − H (Co , Ce )
√
H (Co )H (Ce )
.
(6)
Here H (C ) is the Shannon entropy [33] of C = {C1 , . . . , Cc }, H (C ) = − k=1 pk log pk , pk = nk and nk the number ∑c of nodes in partition Ck (1 ≤ k ≤ c ), n = number of nodes in G. H (Co , Ce ) is the joint entropy of k=1 nk the total∑ ∑ ce c nkl Co = {Co1 , . . . , Coco } and Ce = {Ce1 , . . . , Cece }, H (Co , Ce ) = − ko=1 l= 1 pkl log pkl , pkl = n and nkl is the number of nodes in Cok Cel (1 ≤ k ≤ co , 1 ≤ k ≤ ce ). The NMI is one typical supervised metric that measures how well the ‘‘found’’ communities reflect the true communities. From Eq. (6) we can deduce that NMI (Co , Ce ) ∈ [0, 1] and if Ce is exactly the same as Co , which means all nodes are grouped correctly, then NMI (Co , Ce ) = 1. We generate 14 different networks with µ = 0.05, 0.10, . . . , 0.70, and the other parameters are set to be the same as follows: n = 1024, kmean = 32, kmax = 64, β = 2.5, γ = 1.5, smin = 64, smax = 256. The head size of head-heap is set to 10, i.e. d = 10, and we repeat all the community finding algorithms 100 times and report the medians of the NMI scores. The median of a number set is the middle point of the number set with half of the element numbers bigger and half of the element numbers smaller than it. The results are shown in Fig. 5. When mixing ratio µ ≤ 0.2, all the sampling methods recover the community structure exactly, since the community structures are very strong. As µ increases and the community structure becomes fuzzier, the NMI scores decline gradually. We can see that the curve of the max algorithm is dominated by the curves of the other five sampling methods. All the five sampling methods outperform the max method for 0.25 ≤ µ ≤ 0.55. There are sharp drops when µ varies from 0.55 to 0.60. And when µ = 0.60 the random sampling method performs the best with a narrow gap ahead of the other four sampling methods, i.e. weight, square, inverse, and inversesq. When µ = 0.7, all the NMIs are approximately zero and no more simulations are needed for any greater µ. For such cases, there are very small deviations between the five proposed sampling methods. Even so we can still deduce that the weight and the square methods perform better than others. However the head size d should not be set to too big value. We increase d from 20 to 80 and repeat the optimization procedure 100 times while keeping the networks unchanged. The results are shown in Fig. 6, from which we can see that among the five sampling methods the max method performs the best when µ = 0.5 and µ = 0.55. It is unclear that why max performs the best when µ = 0.5 and µ = 055. One possible explanation is that the networks are generated randomly and the results stem from random fluctuations. But we can see that bigger head size d introduces more randomness in the modularity optimization procedure. Another observation is that the inversesq method tends to perform worse when d gets bigger. This phenomenon tells us that the communities merging with bigger modularity increments is not only a heuristic way to optimize modularity, but also an effective way to find ‘‘nature’’ community structure in networks.
∑c
n
4.2. Tests on the karate network The karate club network [34] is the most frequently used benchmark network in the community analysis [9,14,31,35]. It consists of 34 nodes and 78 edges, which characterizes the social interactions between the individuals in the karate club
1804
X. Liu et al. / Physica A 391 (2012) 1797–1810
Fig. 6. (Color online) The NMI scores for different head sizes d: (a) d = 20, (b) d = 40, (c) d = 80. These networks are the same as the ones in the previous experiment. All algorithms are repeated 100 times and the median of the NMIs are plotted versus mixing the ratio µ.
Fig. 7. (Color online) The karate network that is split into four communities with modularity 0.419 obtained by setting the head size d = 10.
at an American university. The optimal modularity for the karate network is 0.42 [36] which splits the network into four disjoint communities shown by different colors in Fig. 7. In order to see the differences caused by head size we set d = 2, 5, 10, respectively, and repeat the algorithm 100 times. For example, by repeating the weight method 100 times with head size d = 2 we get 100 modularity scores: m1 = 0.38, m2 = 0.41, . . . , m100 = 0.39. The minimum of mi is mmin = min{mi |1 ≤ i ≤ 100} = 0.38. The maximum of mi is mmax = max{mi |1 ≤ i ≤ 100} = 0.42. The median of mi is mmedian = 0.38, which satisfies #{mi |mi ≤ mmedian } = #{mi |mi ≥ mmedian } = 50. Here ‘#A’ means ‘number of elements’ in the set A. The standard deviation (s.d.) of mi is msd =
1 99
∑100
i =1
¯ )2 = 0.01. Here m ¯ is the average modularity, m ¯ = (mi − m
1 100
∑100
i=1
mi . The statistics, i.e. min,
max, median and s.d., of the modularity scores for all the six methods with different head array sizes are shown in Table 1. Since karate club is a small network the max always returns the same community structure as the original CNM algorithm would [17], leading to zero standard deviations. Later we can see that the max returns different community structures for bigger networks. One observation is that almost all random sampling methods outperform the max method, except the inversesq method with d = 5 and d = 10, as indicated by their medians. The median is the fifty percentile or the second quartile of a random variable. So the median modularity of a method bigger than that of the CNM algorithm means that the proposed method on the average outperforms the CNM algorithm in terms of modularity optimization. Another merit of the random sampling methods is that they can report several flexible community structures, especially for large networks, while the CNM algorithm only reports only one single deterministic community structure. The modularity scores obtained by using inverse and inversesq exhibit intensive fluctuations. When d = 2, the minimum modularity scores reported by inverse and inversesq are 0.37 and 0.35, respectively, much smaller than the values produced by the other four methods 0.38. The weight and inversesq lead to the optimal modularity 0.42, the biggest modularity for the karate network reported in the literature [31]. When d = 5, more freedom is introduced in the community merging operation than when d = 2. All five random sampling methods return modularity scores within border ranges. Especially the inversesq gives respectively the
X. Liu et al. / Physica A 391 (2012) 1797–1810
1805
Table 1 Statistics of modularity scores for karate network by different algorithms. All methods are repeated 100 times. The min, max, median and standard deviation (s.d.) of the modularity scores are reported. See text for details. Head size d=2
d=5
d = 10
Statistics
max
random
weight
square
inverse
inversesq
min max median s.d.
0.38 0.38 0.38 0.00
0.38 0.40 0.39 0.01
0.38 0.42 0.38 0.01
0.38 0.40 0.38 0.01
0.37 0.40 0.38 0.01
0.35 0.42 0.38 0.01
min max median s.d.
0.38 0.38 0.38 0.00
0.37 0.42 0.39 0.01
0.37 0.42 0.39 0.01
0.37 0.42 0.39 0.01
0.37 0.42 0.39 0.01
0.32 0.42 0.37 0.02
min max median s.d.
0.38 0.38 0.38 0.00
0.32 0.42 0.39 0.01
0.37 0.42 0.39 0.01
0.37 0.42 0.39 0.01
0.36 0.42 0.39 0.01
0.27 0.42 0.34 0.03
Table 2 Simple descriptions of the networks used in experiments. Abbreviation
Reference
Nodes
Edges
Remarks
power grid hep pgp condmat www
[37] [38] [39] [38] [40]
4,941 5,835 10,680 36,458 325,729
6,594 13,815 24,316 171,736 1,090,108
The American western states power grid network. The high-energy theory collaboration network. The network of users of the Pretty-Good-Privacy algorithm. The condense matter collaboration network. Web pages in the nd.edu domain.
minimum 0.32 and maximum 0.42 with the biggest standard deviation 0.02. Other four random sampling methods provide respectively the minimum 0.37 and maximum 0.42. Although the minimum modularity is smaller than 0.38 as obtained by using the max method, the medians of these four methods are bigger than this value, which means that these sampling methods obtain bigger modularity scores than the max in most situations. Here the phrase ‘‘most situations’’ means more than fifty percents of the repeated runs of the sampling methods, or in other words, when running the sampling algorithms twice, the bigger modularity score produced will be larger than the modularity score obtained by using the max method. When d = 10, the stand deviation is much bigger than those when d = 2 and d = 5. More specially, the standard deviation of the modularity scores obtained by using inversesq is three times as big as the scores obtained by using the other four sampling methods. The minimum modularity scores obtained by the other four sampling methods are much smaller than that produced by the max method, but the maximum modularity scores are still equal to the optimal modularity score. The median modularity scores of the four sampling methods other than the inversesq method are still bigger than the median of the max method. To sum up, we see that any random sampling with a positive head size tends to produce much bigger modularity score than the CNM algorithm on the karate network, while the inversesq tends to produce modularity scores with the most significant variation. The modularity scores produced by using the other four sampling methods lie between the scores produced by these two methods. 4.3. Tests on other real world network In this subsection we test our algorithm on real world networks. The basic information of these networks is shown in Table 2. Comparing to the karate network, these networks have more nodes and edges, thus form a much serious challenge to community analysis algorithms. Here we briefly describe these networks, for details please see the relevant references. The power grid network is an undirected, unweighted representation of the topology of the Western States Power Grid of the United States. The hep network contains the collaboration network of scientists posting preprints on the high-energy theory archive at www.arxiv.org during the time span 1995–1999, as compiled by M.E.J. Newman. The network is weighted, with weights assigned as described in the original papers [41,42]. The pgp network is the connected giant component of the network of users of the Pretty-Good-Privacy algorithm for secure information interchange [39]. The condmat network is the collaboration network of scientists posting preprints on the condensed matter archive at www.arxiv.org between January 1, 1995, and March 31, 2005 [38]. The www network is a subset of the world wide web consisting of 325,729 web pages within the nd.edu domain and 1090,108 hyperlinks between them [40]. Since the algorithm proposed here is stochastic in nature, we repeat each community finding procedure 100 times and report the statistics of the modularity scores in Table 3. For all networks and all sampling methods the parameter d is set to 10, which provides enough flexibility in the community merging as described in Section 2. We have done experiments with different d values with similar results as in the previous subsection obtained. Due to the space limitation, here we will not report these experiments with different head sizes d. One observation is that as the network’s size grows, the max can return
1806
X. Liu et al. / Physica A 391 (2012) 1797–1810
Table 3 Statistics of modularity scores of real world networks produced by using different algorithms. Each method is repeated 100 times. The min, max, median, and standard deviation (s.d.) of the modularity scores are reported. The head size d is set to 10. The basic information of these networks is shown in Table 2. The most significant values are emphasized with bold fonts. See the text for details. Networks power grid
hep
pgp
condmat
www
Statistics
max
random
weight
square
inverse
inversesq
min max median s.d.
0.9313 0.9358 0.9337 0.0010
0.9285 0.9341 0.9314 0.0011
0.9310 0.9359 0.9334 0.0011
0.9290 0.9354 0.9324 0.0012
0.9282 0.9341 0.9317 0.0013
0.9099 0.9232 0.9151 0.0028
min max median s.d.
0.7565 0.7908 0.7811 0.0080
0.7676 0.7890 0.7822 0.0040
0.7658 0.7889 0.7834 0.0039
0.7646 0.7933 0.7836 0.0056
0.7664 0.7880 0.7806 0.0035
0.7362 0.7709 0.7612 0.0056
min max median s.d.
0.8419 0.8533 0.8445 0.0031
0.8129 0.8540 0.8433 0.0078
0.8206 0.8541 0.8446 0.0067
0.8181 0.8537 0.8458 0.0059
0.7841 0.8510 0.8351 0.0119
0.7494 0.8088 0.7850 0.0116
min max median s.d.
0.6032 0.6494 0.6306 0.0104
0.5969 0.6489 0.6310 0.0152
0.5980 0.6480 0.6308 0.0142
0.5982 0.6499 0.6320 0.0143
0.5932 0.6486 0.6319 0.0157
0.6013 0.6475 0.6311 0.0130
min max median s.d.
0.9266 0.9274 0.9273 0.00017
0.9265 0.9279 0.9272 0.00022
0.9265 0.9278 0.9273 0.00024
0.9267 0.9279 0.9273 0.00020
0.9264 0.9278 0.9272 0.00025
0.9135 0.9265 0.9244 0.00244
different community structures, as indicated by non-zero standard deviations. Because these networks are much bigger than the karate network, which consists of only 34 nodes, this result provides an evidence for the equal modularity increments during the community merging procedure. The CNM algorithm blindly chooses one of the equal modularity increments without explicitly using a sampling procedure. And different runs of CNM report the same deterministic community structure except for renumberings of the network nodes. Each renumbering of the nodes can be seen as one possible implicit sampling of the community pairs of equal modularity increments. This potential renumbering is fixed in the early stages of the network data analysis, possibly before the community structure detection; thus it is difficult to be clearly aware whether this renumbering is actually taking place. Hence, the max algorithm provides a systematic way to tackle the community pairs of equal modularity increments. The other five sampling methods, i.e. random, weight, square, inverse, and inversesq, introduce additional flexibility in the community merging. Though this flexibility comes at the price of the extra head array in head-heap, the community structures returned always show better modularity scores. Another observation is that the weight and square sampling methods produce the most significant modularity scores. For the power grid network, the weight method gives the biggest modularity value 0.9359, while the max produces the biggest median. In the cases of the hep, the www, and the condmat network, the square method leads to the biggest modularity score and the biggest median. And for the pgp network, the weight gives the biggest modularity and the square reports the biggest median. The lower bound of the returned modularity scores of the max method is designed to be the largest since it possesses the minimal range of choices during the community merging; and this can be seen clearly from the result shown in Table 3. The max reports the biggest min for the power grid, the pgp, and the condmat networks. Unlike the inverse and invesesq methods, which tend to merge pairs of communities with small modularity increments corresponding to the head array, the weight and square put more weights on the pairs with bigger modularity increments. 4.4. Scalability In this subsection we test the scalability of our algorithm on computer generated networks with as many nodes as up to approximately one million. We vary the head size d in the head-heap to see if the introduction of this extra head array significantly increases the computing time. Generally speaking, all the six sampling methods, i.e. max, weight, square, inverse, inverse, and random, employ the head-heap data structure to sample communities for possible merges. The most time consuming operations are the building and popping of the head-heap that keep track of the modularity increments of all the potential communities that are to be merged. Thus these methods have the same complexity of O(n log2 n) in the case of sparse networks. Here we only show the running times for the weight method with d = 10, 100, and 1000, respectively. The relevant results of time of the other five methods are similar. We also experiment with the original CNM algorithm with the code directly downloaded from Clauset’s web site.1 The implementation of the proposed methods is adopted from the code downloaded from the kdbio website2 and is quite efficient. All experiments in this subsection were conducted on one 1 http://www.cs.unm.edu/aaron/research/fastmodularity.htm. 2 http://kdbio.inesc-id.pt/software/gcf/.
X. Liu et al. / Physica A 391 (2012) 1797–1810
1807
Fig. 8. (Color online) Running times of different algorithms: the original CNM algorithm and the weight algorithm with head array size d = 10, 100, and 1000, respectively. The horizontal axis is log-scaled, and indexes the number of nodes of the benchmark networks. It is much faster than the original CNM code even with the extra head array size d = 1000, especially for large networks.
Table 4 Running times (in seconds) of different methods. Network size
Weight (d = 10)
Weight (d = 100)
Weight (d = 1000)
CNM
1,000 2,000 4,000 8,000 16,000 32,000 64,000 128,000 256,000 512,000 1024,000
0.01 0.01 0.04 0.06 0.20 0.72 2.57 15.11 124.31 1130.57 1317.21
0.02 0.03 0.05 0.12 0.28 0.85 3.29 21.97 135.89 1342.47 1880.26
0.12 0.22 0.43 0.69 1.37 2.88 7.24 31.56 161.91 1438.54 2112.65
0.04 0.08 0.24 0.46 1.38 4.47 15.87 88.01 578.51 1700.12 7446.11
of the authors’ laptop with a 2.27 GHz intel i5 M430 processor and 2GB DDR3-1066 SDRAM, running Ubuntu 9.04 with linux kernel 2.6.31-22 and GNU gcc 4.4.1. We generate 11 benchmark networks [30] with the number of nodes ranging from 1k to 1024k, i.e. we set n = 2c × 1000, with c = 0, 1, . . . , 10, and the other parameters are as follows: kmean = 3, kmax = 100, and β = 2.5, γ = 1.5, smin = 32, smax = 64, µ = 0.35. To our surprise the network generating procedure cost too much time for large networks. In our experiments it takes about 5 h to generate the benchmark network [30] with 1.024 million nodes and 2.122 million edges. So we do not conduct experiments on networks with more than 1.024 million nodes. Another reason for us to choose this setting is that the original CNM algorithm takes too much time to finish on large networks. In our experiment it takes about 2 h and about 560 MB memory to find community structure on the benchmark network with 1.024 million nodes. However our method only uses about 145 MB. Fig. 8 and Table 4 show the running time of different algorithms. When the number of nodes is smaller than 32,000, all the methods produce their outcomes within 5 s, which means that all these methods are impressively fast. For small networks, the additional head array significantly slows down the algorithm. That can be seen from the top five rows in Table 4. Since the absolute running time is quite small, this delay is bearable. When the network size grows bigger, the relative delay disappears and the sampling algorithm with d = 1000 is faster than the original CNM algorithm when the network size is not smaller than 16k. And the gap of the running time between the weight and the original CNM algorithm grows much wider for bigger networks. As for the benchmark network with 1.024 million nodes, the fastest version of the weight method runs in about 22 min and the slowest one in about 35 min. However, the original CNM algorithm runs in about 2 h. Two observations of this experiment can be spotted here. First, the enhanced CNM algorithm [18] is at least 5 times faster than the original method. Second, our proposed method is about 2 times slower than the enhanced CNM algorithm or is about 2 times faster than the original CNM algorithm for large networks, even when the size of the head array is set to d = 1000. The only purpose of setting d = 1000 is to verify its influence of the running time of our proposed algorithm. In practical situations one does not need to establish such a large head array, d = 10 seems to be large enough for the purpose of community detection. Thus the proposed algorithms can run nearly as fast as the enhanced CNM algorithm [18], and much faster than the original implementation [17].
1808
X. Liu et al. / Physica A 391 (2012) 1797–1810
Fig. 9. (Color online) Diversity communities with different modularity scores detected by the proposed algorithm, using the weight sampling strategy with head size d = 10. Partitions are indicated by colors and the modularity scores are shown as titles of the subplots.
4.5. Diversity In this subsection we will take a close look at the communities detected by our methods. The stochastic nature of the proposed algorithms are capable of generating diversity community structures, since multiply runs of random algorithms may produce different results, especially when the objective function is degenerate. As pointed out by Ref. [10], the modularity function not only highly degenerate, but also lacks a clear global optimal. Thus, we may use the stochastic methods proposed here to explore the plateau area of the modularity function. In order to make the problem clearly, we still use the seminal network, karate [34], to demonstrate this idea. We show the communities detected by the weight method with head size d = 10 in Fig. 9. Other sampling strategies also generate similar results and we omit them here. The modularity scores of the partitions range from 0.391519 to 0.41978, and the corresponding community structures are shown in Fig. 9(a)–(i). Here we only plot the community structure with modularity greater than 0.38, which is the modularity score returned by the CNM algorithm. We get about 20 different modularity scores which meet the above condition by repeating the weight sampling method 100 times. Due to space limitations we only report 9 of them in Fig. 9. The rest of them are much like the ones shown here. For Q = 0.391519, Q = 0.394888 and Q = 0.402038 we get three communities, see Fig. 9(a), (c) and (e), and for other values of Q we get four communities. The only difference between Fig. 9(a) and (c) is the membership of V24 . It has three links to the community in the lower left corner and two links to the partition on the right hand side. Also we can see that the lower left community is more dense than the right one. Another observation is that V24 belongs to the right group in Fig. 9(i) with Q = 0.419790, which is larger than Q = 0.394888 when V24 belongs to the left lower community. So both the memberships of V24 make sense. The community structure revealed by Fig. 9(e) is much different from the above two. The right group is merged with the lower left one and the upper one is split into two. This great difference cannot be spotted clearly only by modularity, since there are only small gaps between modularity scores between (a), (c) and (e). Please look at Fig. 9(c) and (d), the modularity scores increase from 0.394888 to 0.395135, with less than 1% difference, but there are huge variations in the corresponding community structures: 6 of the 34 vertexes, i.e. 17.6% of the total vertexes, change their memberships. The
X. Liu et al. / Physica A 391 (2012) 1797–1810
1809
6 nodes are V3 , V6 , V7 , V10 , V17 and V24 . Fig. 9(b) and (f) is another example, modularity scores increase from 0.392012 to 0.402285, with less than 1% increment, but the number of partitions change from 3 to 4. Bigger modularity does not lead to more partitions. For example, please look at Fig. 9(d) and (e), modularity scores increase from 0.395135 to 0.402038, with less than 1% increment, but partitions reduce from 4 to 3. But for Fig. 9(g)–(i), tiny difference in modularity lead to nearly identical partitions, only V24 and V10 change their memberships. To sum up, a tiny gap in modularity scores may mean a big difference in community structures, and small changes are possible too. Thus, diversity naturally exists and modularity itself cannot reveal variations of community structures. This agrees with the result of Good et al. [10] who claim that the modularity function is highly degenerate and lacks a clear global optimal. Since the proposed algorithm is very fast, one may run it multiple times to get diversity community structures. 5. Conclusions In this work we tackle the problem of community detection in complex networks and propose a series of random sampling methods to uncover community structures with high modularity scores. A new data structure called head-heap is proposed to support the sampling of community pairs for merging. This data structure consists of a max-heap and a head array. The construction and popping of the head-heap can be done in O(log n) time, thus the proposed community finding algorithms runs in O(n log2 n) time, the same as the CNM algorithm. Our method inherits the efficiency of the enhanced CNM algorithm and the head array only slows down the algorithm a little bit but provides great diversity in the discovered community structures with high modularity scores due to the degeneracy of the modularity scores which guide the community finding procedure. Here we also proposed six simple random sampling methods, i.e. max, random, weight, square, and squaresq, to sample the community pairs by indexing the head array in the head-heap. Obviously, additional sampling schema, such as the MCMC [43] and simulated annealing [44], and different weighting methods can be introduced into the sampling operation in future work. Another issue is the head array size d that can be varied according to the total number of elements in the max-heap during the community finding procedure, especially at the final stage, in order to keep the randomness in community merging more under control. Currently, our methods are only capable of dealing with disjoint community structures. That is, for any Ca ∈ C and Cb ∈ C we have Ca ∩ Cb = φ, ∀a ̸= b. However, for real networks in nature and society the situations Ca ∩ Cb ̸= φ are also allowed. This kind of partition is called the community structure with overlapping [45]. The modularity score, see Eq. (2), and the NMI score, see Eq. (6), are both defined for non-overlapping community structures. Thus algorithms proposed here are not ready for overlapping community detection. We plan to handle these issues in our future work. A more fundamental future work should be about the community quantify function. Since modularity suffers from many problems, new metrics for quantify community structures should be developed. For example, the modularity density [23] is a rather good attempt, although it suffers from particular problems [24]. Our method can be adopted properly to work with the new metric. That is, we may again start from singleton communities, and store metric increments caused by communities merging in the head heap, then use different sampling strategies to decide which pair should be merged step by step until all nodes are merged into a single group, and finally cut the dendrogram to get the optimal partition. Since the proposed algorithms are capable of generating many near-optimal modularity partitions, another interesting future work could be on the ensembles of these partitions [46]. They can be used to find robust communities, i.e. sets of nodes that remain co-clustered across many partitions; they can be used to gauge confidence in the optimal solution based on the quality of nearby solutions, too. Acknowledgments This work is supported by the Natural Science Foundation of China under grant 61005003. The authors thank A. Clauset and A.P. Francisco for providing public download service of their community detection codes. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
R. Albert, A.L. Barabási, Statistical mechanics of complex networks, Reviews of Modern Physics 74 (1) (2002) 47–97. M.E.J. Newman, The structure and function of complex networks, SIAM Review 45 (2) (2003) 167–256. S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, D.U. Hwang, Complex networks: structure and dynamics, Physics Reports 424 (4–5) (2006) 175–308. L.F. Costa, O.N. Oliveira Jr., G. Travieso, F.A. Rodrigues, P.R.V. Boas, L. Antiqueira, M.P. Viana, L.E.C. Da Rocha, Analyzing and modeling real-world phenomena with complex networks: a survey of applications, Arxiv preprint http://arxiv.org/abs/0711.3199. A. Arenas, A. Diaz-Guilera, C.J. Pérez-Vicente, Synchronization reveals topological scales in complex networks, Physical Review Letters 96 (11) (2006) 114102. M. Girvan, M.E.J. Newman, Community structure in social and biological networks, Proceedings of the National Academy of Sciences of the United States of America 99 (12) (2002) 7821. Y. Pan, D.H. Li, J.G. Liu, J.Z. Liang, Detecting community structure in complex networks via node similarity, Physica A: Statistical Mechanics and its Applications 389 (14) (2010) 2849–2857. M. Karsai, M. Kivelä, R.K. Pan, K. Kaski, J. Kertész, A.L. Barabási, J. Saramäki, Small but slow world: how network topology and burstiness slow down spreading, Physical Review E 83 (2) (2011) 025102. S. Fortunato, Community detection in graphs, Physics Reports 486 (3–5) (2010) 75–174. B.H. Good, Y.A. De Montjoye, A. Clauset, Performance of modularity maximization in practical contexts, Physical Review E 81 (4) (2010) 046106.
1810
X. Liu et al. / Physica A 391 (2012) 1797–1810
[11] B.W. Kernighan, S. Lin, An efficient heuristic procedure for partitioning graphs, Bell System Technical Journal 49 (2) (1970) 291–307. [12] M. Abramawitz, I.A. Stegun, Handbook of Mathematical Functions, National Bureau of Standards. [13] U. Brandes, D. Delling, M. Gaertler, R. Gorke, M. Hoefer, Z. Nikoloski, D. Wagner, On modularity clustering, Knowledge and Data Engineering IEEE Transactions on 20 (2) (2008) 172–188. [14] M.E.J. Newman, Fast algorithm for detecting community structure in networks, Physical Review E 69 (6) (2004) 066133. [15] M.E.J. Newman, Finding community structure in networks using the eigenvectors of matrices, Physical Review E 74 (3) (2006) 036104. [16] D. Chen, Y. Fu, M. Shang, A fast and efficient heuristic algorithm for detecting community structures in complex networks, Physica A: Statistical Mechanics and its Applications 388 (13) (2009) 2741–2749. [17] A. Clauset, M.E.J. Newman, C. Moore, Finding community structure in very large networks, Physical Review E 70 (6) (2004) 066111. [18] A.P. Francisco, A.L. Oliveira, Improved algorithm and data structures for modularity analysis of large networks, in: NIPS Workshop on Analyzing Graphs, 2008. [19] S. Fortunato, M. Barthelemy, Resolution limit in community detection, Proceedings of the National Academy of Sciences of the United States of America 104 (1) (2007) 36–41. [20] A. Lancichinetti, S. Fortunato, Community detection algorithms: a comparative analysis, Physical Review E 80 (5) (2009) 056117. [21] J. Kumpula, J. Saramäki, K. Kaski, J. Kertesz, Limited resolution in complex network community detection with potts model approach, The European Physical Journal B-Condensed Matter and Complex Systems 56 (1) (2007) 41–45. [22] L. Branting, Information theoretic criteria for community detection, Advances in Social Network Mining and Analysis (2010) 114–130. [23] Z. Li, S. Zhang, R. Wang, X. Zhang, L. Chen, Quantitative function for community detection, Physical Review E 77 (3) (2008) 036109. [24] X. Zhang, R. Wang, Y. Wang, J. Wang, Y. Qiu, L. Wang, L. Chen, Modularity optimization in community detection of complex networks, EPL (Europhysics Letters) 87 (2009) 38002. [25] E.A. Leicht, M.E.J. Newman, Community structure in directed networks, Physical Review Letters 100 (11) (2008) 118703. [26] M.E.J. Newman, Analysis of weighted networks, Physical Review E 70 (5) (2004) 056131. [27] N.A. Alves, Unveiling community structures in weighted networks, Physical Review E 76 (3) (2007) 036101. [28] M. Molloy, B. Reed, A critical point for random graphs with a given degree sequence, Random Structures & Algorithms 6 (2–3) (1995) 161–180. [29] T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, third ed., MIT Press, 2009. [30] A. Lancichinetti, S. Fortunato, F. Radicchi, Benchmark graphs for testing community detection algorithms, Physical Review E 78 (4) (2008) 046110. [31] M.E.J. Newman, M. Girvan, Finding and evaluating community structure in networks, Physical review E 69 (2) (2004) 026113. [32] L. Danon, A. Diaz-Guilera, J. Duch, A. Arenas, Comparing community structure identification, Journal of Statistical Mechanics: Theory and Experiment 2005 (2005) P09008. [33] C.E. Shannon, A mathematical theory of communication, The Bell System Technical Journal 27 (1948) 379–423, 623–656. [34] W.W. Zachary, An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33 (4) (1977) 452–473. [35] U. Raghavan, R. Albert, S. Kumara, Near linear time algorithm to detect community structures in large-scale networks, Physical Review E 76 (3) (2007) 036106. [36] M.E.J. Newman, Modularity and community structure in networks, Proceedings of the National Academy of Sciences 103 (23) (2006) 8577–8582. [37] D.J. Watts, S.H. Strogatz, Collective dynamics of small-worldnetworks, Nature 393 (6684) (1998) 440–442. [38] M.E.J. Newman, The structure of scientific collaboration networks, Proceedings of the National Academy of Sciences of the United States of America 98 (2) (2001) 404–409. [39] M. Boguñá, R. Pastor-Satorras, A. Díaz-Guilera, A. Arenas, Models of social networks based on social distance attachment, Physical Review E 70 (5) (2004) 056122. [40] R. Albert, H. Jeong, A.L. Barabási, Internet: diameter of the world-wide web, Nature 401 (6749) (1999) 130–131. [41] M.E.J. Newman, Scientific collaboration networks. i. network construction and fundamental results, Physical Review E 64 (1) (2001) 016131. [42] M.E.J. Newman, Scientific collaboration networks. ii. shortest paths, weighted networks, and centrality, Physical Review E 64 (1) (2001) 016132. [43] C. Andrieu, J. Thoms, A tutorial on adaptive mcmc, Statistics and Computing 18 (4) (2008) 343–373. [44] S. Kirkpatrick, C.D. Gelatt Jr., M.P. Vecchi, Optimization by simulated annealing, Science (New York, NY) 220 (4598) (1983) 671–680. [45] G. Palla, I. Derényi, I. Farkas, T. Vicsek, Uncovering the overlapping community structure of complex networks in nature and society, Nature 435 (7043) (2005) 814–818. [46] G. Duggal, S. Navlakha, M. Girvan, C. Kingsford, Uncovering many views of biological networks using ensembles of near-optimal partitions, in: Proceedings of MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings, KDD, ACM, 2010.