Physica A 522 (2019) 69–79
Contents lists available at ScienceDirect
Physica A journal homepage: www.elsevier.com/locate/physa
Overlapping community detection based on conductance optimization in large-scale networks ∗
Yang Gao , Hongli Zhang, Yue Zhang School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
highlights • • • • •
We present a fast and accurate overlapping community detection algorithm. We present a seeding method, which selects disjoint clusters as seeds. We refine communities by node movements in nearly linear time. We propose a novel and precise community combining strategy. We propose a strategy that can select communities for outliers efficiently.
article
info
Article history: Received 3 May 2018 Received in revised form 24 September 2018 Available online 2 February 2019 Keywords: Community detection Conductance optimization Community combining Node movements
a b s t r a c t Community structure reveals useful information in domains of sociology, biology, physics and computer science. In this work, an overlapping community detection algorithm for large-scale networks based on local expansion is proposed, in which we present a novel seeding method. And we optimize conductance of communities by: (1) modifying inaccurate community affiliations by node movements; (2) combining densely overlapping communities with a novel combining function; (3) finding communities for the outliers with our proposed theorem. Experimental results in synthetic networks show that the optimization largely enhance the community accuracy. Experimental results in large realworld networks show that our approach is superior to the others in the state of the art. © 2019 Elsevier B.V. All rights reserved.
1. Introduction Analyzing community structure in complex networks is an important research area, since it helps to reveal useful hidden information. A community is defined as a dense cluster, which contains a set of nodes with more links among them and less links to rest of the network [1]. And communities overlap when nodes belong to multiple communities. In the early work, algorithms for locating overlapping communities are often computationally expensive, many of which can hardly analyze community structure in networks with 105 nodes [2–4]. Experimental results in Ref. [5] show that MMSB [4] cannot analyze community structure in networks with 104 nodes in 104 s, LC [2] and CPM [3] run extremely slow when they handle networks with more than 104 nodes. Recently, many global and local overlapping community detection methods including Bigclam [5], Demon [6], Oslom [7], NISE [8], aimed at large networks, are proposed, experimental results in Section 5.2 demonstrate that they are inefficient in large-scale networks as well. In this work, we propose a novel method for uncovering overlapping community structure named LECM (Local Expansion and Conductance Minimizing), which is able to achieve high efficiency and accuracy compared to other proposals. In ∗ Corresponding author. E-mail address:
[email protected] (Y. Gao). https://doi.org/10.1016/j.physa.2019.01.142 0378-4371/© 2019 Elsevier B.V. All rights reserved.
70
Y. Gao, H. Zhang and Y. Zhang / Physica A 522 (2019) 69–79
particular, LECM is able to locate communities in networks with 105 nodes in seconds with much higher accuracy compared to other local expansion methods. The main contributions of this work are as follows.
• We present an accurate local overlapping community detection algorithm, which is faster by orders of magnitude than the current state-of-the-art solutions.
• We present a novel seeding method, which selects disjoint clusters that compose the whole network as seeds. • We propose theorems that are able to measure conductance variation of communities owing to node movements (inserting nodes to communities or removing nodes out of communities) in nearly linear time.
• We propose a novel and precise community combining strategy that merges communities based on similarity and conductance optimization.
• We propose a strategy that selects communities for the outliers. We prove that the method can not only improve the final clustering quality in terms of conductance but also balance the number of outliers and the clustering quality. 2. Related work A large amount of work has been done on finding overlapping communities in complex networks. We introduce a few popular ideas in the literature to show how our approach outperforms. 2.1. Global methods Early methods for identifying communities mainly focus on the global structure of networks, most of which are computationally expensive for large-scale networks. Clique percolation method (CPM) [3] finds communities by searching for unions of k-cliques. The method can detect overlapping communities [9] since a node can belong to several k-cliques simultaneously. Link Partitioning [2] deals with edges of the network instead of nodes. Communities overlap when edges linked to a node belong to several communities. Non-negative Matrix Factorization (NMF) [5,10–12] is a technique in machine learning that has been used in community detection, NMF factorizes feature matrix V into two non-negativity matrices as V ≈WH, where elements in normalized W denote dependence of vertices with respect to the communities [13]. 2.2. Local methods Some research shifted to local structure of networks when mining communities in large-scale networks. Community detecting methods based on local expansion start from seeking core vertices that are named as seeds, from which communities are expanded by means of the optimization of a fitness function. LFM [14] expands a community from a random vertex that has not been visited until all vertices in the network have been visited, the quality of communities identified by LFM depends significantly on the parameter in its fitness function. OCA [15] expands a community by seeking the largest increase in value of its fitness function when removing or inserting a vertex. NISE removes vertices that are not closely connected to the rest of the network in the front of the algorithm, and establishes several sophisticated seeding strategies. NISE uses personalized PageRank vectors [16] to grow communities, to which the removed nodes are attached finally. 3. Definitions Definition 1. G = (V , E) is defined as an undirected and unweighted network, where V is the set of vertices and E is the set of edges. Definition 2. l(Ci , Cj ) is defined as the number of edges between nodes in cluster Ci and nodes in cluster Cj . Definition 3. cut(Ci ) is defined as the number of links between nodes in Ci and nodes in its complement Ci = V \Ci , cut(Ci ) = l(Ci , Ci ) [8]. We denote cut(Ci ) by c(Ci ) in this paper. Definition 4. The conductance of a cluster Ci is defined as follows [16]: cond(Ci ) = cond(Ci ) =
c(Ci ) min(l(Ci , V ), l(Ci , V )) c(Ci ) c(Ci ) l(Ci , V )
=
.
2e(Ci ) + c(Ci )
(1)
,
if we assume that l(Ci , V ) ≤ l(Ci , V ) for any Ci , where e(Ci ) is the number of edges with both endpoints in Ci .
(2)
Y. Gao, H. Zhang and Y. Zhang / Physica A 522 (2019) 69–79
71
Fig. 1. The structure of LECM.
4. Method LECM finds and refines overlapping communities by minimizing the conductance of communities. Specifically, LECM selects disjoint clusters as seeds and expands communities by personalized PageRank vectors in Ref. [16], furthermore, LECM refines communities by conductance optimization in three ways: node movements, combining communities and selecting communities for outliers. The structure of the method is illustrated in Fig. 1. 4.1. Seeding Seeds are selected by various indexes in the literature, among which degree centrality clearly reveals the vertex influence [17]. The influence-based seeding phase is presented in Algorithm 1. First, all nodes in the network are sorted decreasingly by their degree. We select the first node in the sequence as the core of a seed, the core with its neighbors in the sequence composes a seed, then all nodes in the seed are removed from the sequence. Other seeds are selected in the same way until the sequence is empty. In the algorithm, nodes with large degree select members for their seeds in priority, where seeds do not overlap with each other. Expanding a community from the neighborhood of a core node has been proposed in Ref. [18], in which the authors theoretically demonstrate that such method may outperform expansion from a single node. While there is a serious drawback in the method: the neighborhoods can overlap with each other, and dense overlapping of two neighborhoods may lead to the consequence that their expansions are the same community. Our algorithm addresses the weakness. Besides, we introduce no parameters in the method. And the seeds, that compose the whole network, will expand communities with high coverage rate of the network. Algorithm 1: Seeding Data: network G(V , E) Result: seed set S 1 Initialize S = ∅; 2 Sort nodes in V decreasingly by their degree, we get the sequence l; 3 Select the first node and its neighbors in l as a seed, insert it into S and remove all nodes in the seed from l; 4 if l ̸ = ∅, turn 3, else return S. 4.2. Seed expansion After obtaining the seeds, the PageRank-Nibble algorithm in Ref. [16], which measures the approximate PageRank vectors, is utilized to expand the initial partition of the network. The error of the PageRank vectors by PageRank-Nibble algorithm can be less than ε with time complexity O(1/ε ) [19]. After obtaining the PageRank vector for a seed, each node u around the seed get a PageRank score pu , which measures the proximity between the node and the seed. The nodes are then sorted by pu /d(u) decreasingly, where d(u) denotes the degree of node u. The set of the first k nodes in the sequence that achieves the best conductance is the community for the seed. The details of the algorithm can be found in Ref. [16]. The approach with the above phases (seeding and seed expansion) is named NLEA (Naive Local Expansion Algorithm). 4.3. Node movements In this section, we optimize conductance of communities located by NLEA via node movements. Movements for a vertex is defined as follows. • Move out: remove a node from communities to which it currently belongs.
72
Y. Gao, H. Zhang and Y. Zhang / Physica A 522 (2019) 69–79
• Move in: insert a node to communities to which it does not belong.
First, we present some theorems that help to reduce the time complexity for the movements, where condI (v, C ) denotes the decrease in conductance of a community C for inserting a node v, and condO (v, C ) denotes that for removing a node v. Theorem 1. Let P = {C1 , . . . , Cn , {v}} be a partition of a network G(V , E), in which the node v can be a member of any community Ci except C1 . A new community partition P1 = {C1′ , . . . , Cn } is obtained if the node v is inserted into the community C1 , where C1′ = C1 ∪ {v}. Then, condI (v, C1 ) = cond(C1 ) − cond(C1′ ) =
c(C1 ) d c(C1 )+2e(C1 ) v
− dv + 2lv,C1
(3)
c(C1 ) + dv + 2e(C1 )
where dv denotes the degree of node v. Theorem 2. Let P = {C1 , . . . , Cn } be a community partition of a network G(V , E) where C1 = C1′ ∪ {v}. A new community partition P1 = {C1′ , . . . , Cn , {v}} is obtained if the node v is removed from community C1 . Then, condO (v, C1 ) = cond(C1 ) − cond(C1 ) = −condI (v, C1 ) = ′
′
dv − 2lv,C1 −
c(C1 )+2lv,C −dv 1 c(C1 )+2e(C1 )−dv
dv
c(C1 ) + 2e(C1 )
(4)
From the two theorems, we can get that only the number of edges between node v and nodes in community C needs computing to measure the decrease in conductance for the two types of movements, since we can update e(C ) and c(C ) in time O(1). First, we remove small communities with less than three members, since such communities tend to be invalid in real networks, which is in accordance with the findings in Ref. [20]. Second, for any community Ci , we measure the decrease in conductance for removing any node that is currently in the community. The node is marked Move Out if the value is positive, and we update c(Ci ) and e(Ci ) for the removal. Then, we go through all nodes that are not in Ci and directly connected to nodes in Ci . We mark the node Move In when the decrease in conductance is positive if inserted, and we update c(Ci ) and e(Ci ) for the insertion. Finally, we refine Ci by moving the marked nodes. Discarding small communities and removing nodes from communities in the algorithm may result in homeless nodes, we find communities for them in Section 4.5. A preliminary version of the above work has appeared in our previous paper [21]. 4.4. Combining communities After the phase of node movements, we get a raw network partition, which is a collection of communities and outliers. As many selected seeds are linked to each other, communities expanded from them can probably share much substructures, two communities need to be merged if they are densely overlapped and the combination of them possesses better conductance. To this end, we define a combining function of two communities as follows.
⏐ ⏐ ⏐Ci ∩ Cj ⏐ ⏐ + (1 − θ ) · f (cond(Ci ), cond(Cj ), cond(Ci ∪ Cj )) CS(Ci , Cj ) = θ · ⏐ ⏐Ci ∪ Cj ⏐
(5)
And we define f (cond(Ci ), cond(Cj ), cond(Ci ∪ Cj )) as: cond(Ci ) + cond(Cj ) 2 · cond(Ci ∪ Cj ) + cond(Ci ) + cond(Cj )
(6)
Algorithm 2: Combining communities Data: the partition P = {C1 , . . . Cn } identified previously Result: a combined community partition 1 Initialize done=False 2 while !done 3 done=true; 4 for each community Ci ⊂ P 5 for each community Cj ⊂ com_neighbor(Ci ) 6 if CS(Ci , Cj ) ≥ β 7 combine Ci and Cj , done=False; 8 end for 9 end for 10 end while The first part of CS(Ci , Cj ) is the widely-adopted Jaccard coefficient for binary sets [22]. In comparison with it, the proposed combining function includes the fraction of similarity of the two communities as well as the fraction of conductance change
Y. Gao, H. Zhang and Y. Zhang / Physica A 522 (2019) 69–79
73
due to the merging, which is crucial to get better community quality through merging. On the basis of Eqs. (7) and (8), Eq. (6) is a standardized measure that varies from 0 to 1. 2 · cond(Ci ∪ Cj ) ≫ cond(Ci ) + cond(Cj ) ⇒ f (cond(Ci ), cond(Cj ), cond(Ci ∪ Cj )) → 0
(7)
cond(Ci ∪ Cj ) = 0 ⇒ f (cond(Ci ), cond(Cj ), cond(Ci ∪ Cj )) = 1
(8)
In merging phase, we combine communities Ci and Cj if the equation CS(Ci , Cj ) ≥ β holds. The procedure of combining communities phase is shown in Algorithm 2, where com_neighbor(Ci ) is the set of communities that overlap with community Ci . Theorem 3. The average conductance of a community partition tends to decrease after merging by CS(Ci , Cj ) if β > 0.75, when θ is set to 0.5. Proof. Provided that communities Ci and Cj are merged by CS(Ci , Cj ), and k denotes the number of communities before merging, it is easy to get: cond(Ci ) + cond(Cj )
> cond(Ci ∪ Cj ), then ∑ 1∑ 1 cond(Cl ) − ( cond(Cl ) + cond(Ci ∪ Cj )) k k−1 2
l̸ =i,l̸ =j
> =
1∑ k
cond(Cl ) −
k−2 k(k − 1)
(
1 k−1
cond(Ci ) + cond(Cj ) 2 cond(Cl ) −
≥ 1
k−1
∑
cond(Ci ) + cond(Cj ) 2
∑ −
l̸ =i,l̸ =j
cond(Cl )
k−2
)
)
cond(Cl ) since the merged communities tend to be in low quality and have conductance no
1∑ k (
cond(Cl ) +
l̸ =i,l̸ =j
2
cond(C )+cond(C )
k
∑
cond(Ci ) + cond(Cj )
i j Supposing ≥ 1k 2 less than the average. Then,
1∑
(
∑
∑ cond(Cl ) ≥
l̸ =i,l̸ =j
cond(Cl )
k−2
cond(Cl ) + cond(Ci ∪ Cj )) > 0
□
l̸ =i,l̸ =j
4.5. Selecting communities for outliers
Algorithm 3: Finding communities for outliers Data: vertices that belong to no communities and a raw community partition Result: final communities 1 for each vertex v that belongs to no communities 2 for each community Ci ⊂ com_neighbor(v ) 3 if Eq. (11) holds 4 insert v into Ci ; 5 end for 6 end for 7 for each vertex v that belongs to no communities 8 if inequality (9) holds 9 insert v into community C by Eq. (10); 10 end for Now we deal with the outliers. We find communities for them by Algorithm 3, where com_neighbor(v ) is the set of communities that are linked to node v. First, we select communities for the outliers by Theorem 4, line 1–6. Then we find communities for the nodes that still belong to no communities by values of the left side of inequality (9), and we use the parameter χ to balance conductance of the final communities and the number of outliers. A node v will be out of the set of outliers if inequality (9) holds, and we insert the node to the only community C by Eq. (10).
74
Y. Gao, H. Zhang and Y. Zhang / Physica A 522 (2019) 69–79
lv,Ci
max
Ci ⊂com_neighbor(v )
C =
> χ.
dv
lv,Ci
arg max
dv
Ci :Ci ⊂com_neighbor(v )
(9)
.
(10)
Theorem 4. When a node v, which is not a member of a community Ci , is moved into the community, the conductance of Ci will decreases if
∑
lv,Ci
1−2·
m∈Ci
<
dv
dm − 2e(Ci )
∑
m∈Ci
dm
.
(11)
Proof. When a node v is inserted into a community Ci , the new community is Ci ∪ {v}. By the definition of conductance, conductance of Ci and Ci ∪ {v} are:
∑
m∈Ci
dm − 2e(Ci )
∑
m∈Ci
If 1 − 2 ·
dm
∑
∑
m∈Ci
∑
lv,Ci
dm − 2e(Ci )
m∈Ci
<
dv
dm + dv − 2e(Ci ∪ {v})
m∈Ci
and
∑
m∈Ci
dm
.
,
∑
dm − 2e(Ci ) ∑ dv dm ∑ m∈Ci ∑ dv + m∈C dm − m∈C dm − 2 · lv,Ci i i
dv − 2 · lv,Ci
dm + dv
<
m∈Ci
∑
dm − 2e(Ci ) ∑ m∈Ci dm ∑ ∑ dv + m∈C dm − 2e(Ci ∪ {v}) + 2e(Ci ∪ {v}) − m∈C dm − 2 · lv,Ci i i dv
m∈Ci
<
dv
∑
dm − 2e(Ci ∪ {v}) < ⎝
m∈Ci
<
m∈Ci
dm − 2e(Ci )
∑
m∈Ci
⎛ dv +
∑
dm
⎞ ∑
dm − 2e(Ci )⎠ · ∑
m∈Ci
dv m∈Ci
dm
− 2e(Ci ∪ {v}) +
∑
dm + 2 · lv,Ci
m∈Ci
For e(Ci ∪ {v}) = e(Ci ) + lv,Ci ,
⎛ dv +
∑
dm − 2e(Ci ∪ {v}) < ⎝
⎞ ( dv dm − 2e(Ci )⎠ · ∑
m∈Ci
m∈Ci
m∈Ci
dv +
∑
∑
m∈Ci
∑
dm − 2e(Ci ∪ {v})
m∈Ci
dm + dv
∑ <
m∈Ci
dm − 2e(Ci )
∑
m∈Ci
dm
) dm
+1
□
4.6. Analysis on time complexity We analyze the time complexity of each phase in this section. Let m denote the number of vertices in the network, n denote the number of edges, k denote the number of uncovered communities, neighbor(Ci ) denote the set of nodes, which are not in community Ci and connected to vertices in Ci . In seeding phase, the∑ sorting operation bounds the execution time, k the time cost is O(m log m). In seed expansion phase, the time complexity is O( i=1 l(Ci , V )) [23]. In node movements phase, the complexity for marking Move Out is as follows. O(
k ∑ ∑
dvj ).
(12)
i=1 vj ∈Ci
The complexity for marking Move In is as follows. O(
k ∑ ∑
(
∑
dvj +
i=1 vj ∈Ci
dvl )).
(13)
vl ∈neighbor(Ci )
Thus, the total complexity of the phase is as follows. O(
k ∑
∑
i=1
vj ∈Ci
(2 ·
dvj +
∑ vl ∈neighbor(Ci )
dvl )),
(14)
Y. Gao, H. Zhang and Y. Zhang / Physica A 522 (2019) 69–79
75
Table 1 Information of the synthetic networks. Networks
Om
µ
On
N1 N2 N3 N4
2 2 4 4
0.1 0.3 0.1 0.3
0–5000 0–5000 0–5000 0–5000
which is nearly equal to O(n). In the phase of combining communities, we have the cost
∑
O(
∑
dvl ),
(15)
Ci ∩Cj ̸ =∅ vl ∈Ci ∪Cj
which is also nearly equal to O(n). In the phase of selecting communities for outliers, we have the cost
∑
O(
dvl |com_neighbor(vl )|) ≪ O(n).
(16)
vl ∈outliers
Hence, we achieve refined communities by the three phases in nearly linear time. 5. Experimental setup and results NLEA and LECM are written in C++. And all the experiments are carried on a PC with a 3.30 GHz Intel Core processor and 16 GB RAM. 5.1. Compared with the naive local expansion algorithm Data sets: We utilize synthetic networks generated by LFR overlapping benchmark proposed by Lancichinetti and Fortunato [24], which are widely used to evaluate overlapping community detection algorithms. There are several user-specified parameters in LFR overlapping benchmark networks as follows. N denotes the number of vertices in the network; k denotes the average degree of vertices; Cmin and Cmax denote the minimum and maximum community size respectively; kmax denotes upper bound on degree of vertices; µ denotes the expected proportion of external degree of a vertex for its community in its total degree. Locating communities in a network gets harder when µ is greater. On specifies the number of overlapping vertices and Om is the upper bound on membership of vertices. Metrics: We employ the following three criteria to compare LECM with the naive local expansion algorithm (NLEA). • Average F1 score [5], which quantifies the correspondence between the identified and the ground-truth } { communities communities. The Average F1 score for two partitions P1 = {C1 , . . . , Cn } and P2 = C1′ , . . . , Cn′ of a network is defined as follows. 1/2(
1
∑
|P1 |
F 1(Ci , Cf′ (i) ) +
Ci ∈P1
1
|P2 |
∑
F 1(Cf ′ (i) , Ci′ )),
(17)
C ′ ∈P i
2
in which F 1(Ci , Cj′ ) denotes the harmonic mean of the Precision and the Recall, f (i) = arg maxj F 1(Ci , Cj′ ), f ′ (i) = arg maxj F 1(Cj , Ci′ ). • Normalized Mutual Information (NMI) [14] that is specially proposed to evaluate the quality of overlapping communities. The value of NMI quantifies the similarity between two community partitions of a network, one of them is always the groundtruth community partition. • We define Average Conductance as:
/ ·
1 k
∑
cond(Ci ),
(18)
Ci ⊂P
which is the average conductance of all communities in a partition of a network. The parameter k in Eq. (18) denotes the number of communities, and P denotes a community partition of the network. We quantify conductance improvement with the parameter. Set up: We generate four sets of networks with the same parameters of N = 10 000, kmax = 50, k = 15, Cmax = 50, Cmin = 10, and other parameters are summarized in Table 1. Each group includes six networks, in which On ranges from 0 to 0.5N. Values for the parameters: In PageRank-Nibble algorithm, we set α = 0.01 [8,25], and ε = 10−4 in all experiments. In combining communities phase, we set θ = 0.5, since we regard that conductance refinement is as important as similarity in combining, and set β = 0.8, which tends to result in lower average conductance by Theorem 3 and yields the best results in our experiment. We set χ = 0.5 in selecting communities for outliers phase. Community quality: the NMI results in the four sets of networks are illustrated in Fig. 2. Apparently, LECM achieves higher NMI on all the synthetic networks, which indicates that the refinement phases are valid and stable. When the mixing rate
76
Y. Gao, H. Zhang and Y. Zhang / Physica A 522 (2019) 69–79
Fig. 2. NMI of NLEA and LECM on synthetic networks. Every point in the figures denotes the average of 10 executions on networks generated randomly with parameters in Table 1.
µ increases (from 0.1 to 0.3), the community structure becomes more ambiguous, both NLEA and LECM achieve smaller NMI values, and the distance between the two lines gets larger, which indicates that the refinement phases increase the adaptability of the algorithm to fuzzy networks. The Average F1 scores are illustrated in Fig. 3. Nearly the same conclusions can be obtained. Fig. 4 gives the results on Average Conductance of the two algorithms. It reveals that the communities found by LECM possess much lower average conductance compared to that of NLEA, which indicates that we optimize conductance successfully. And the low average conductance leads to the high quality of the community structure. We process synthetic networks with 20 000 and 50 000 vertices as well, we omitted the results since they vary little from the above results. Thus, we strongly believe that the refinement phases are effective and steady. 5.2. Compared with the state-of-the-art overlapping community detection algorithms Data sets: We utilize the networks provided by SNAP [26], which include the ground-truth communities. The information of the networks is summarized in Table 2. Set up: We compare LECM to three globally based overlapping community detection methods: Bigclam [5], Demon [6], Oslom [7], and one localized method: NISE [8] with implementations provided by authors. All algorithms are set to run in a single thread. In Bigclam, we set the parameter k, which denotes the number of communities, to the number of ground-truth communities. In Oslom, we set the parameter hr, which denotes the number of runs for higher hierarchical level, to zero. In NISE, we use load_graph in Ref. [18] to create the MATLAB sparse matrix that NISE needs, and use writeSMAT in Ref. [18] to write down the identified communities. We set the parameter k, which controls the number of communities, to the real number of ground-truth communities as well. Sphub that achieves the best performance in Ref. [8] is selected as the seeding strategy, both expansion methods ppr and vppr are selected for NISE in our experiments. We have also carried NISE on synthetic networks since it is also based on Local Expansion and employs the same seed expansion method as LECM. We omitted the results since the performance of the algorithm is not competitive compared to NLEA and LECM. Running Time and Community accuracy: Fig. 5 gives the results on running time of LECM and all baselines on DBLP and Amazon datasets. LECM is faster by orders of magnitude than the baselines including the localized algorithm NISE. The results of community accuracy on DBLP and Amazon datasets are demonstrated in Figs. 6 and 7. On DBLP dataset, LECM and Oslom obtain the best community structure in terms of F1 score and NMI. LECM exceeds NISE with ppr by 26% and NISE with fppr by 31% in terms of Average F1 score, the NMI value achieved by LECM almost doubles that of NISE with fppr. On Amazon dataset, LECM obtains similar result as that of Bigclam in terms of Average F1 score though Bigclam processed networks with priori number of the ground-truth communities, and LECM performs the best in terms of NMI.
Y. Gao, H. Zhang and Y. Zhang / Physica A 522 (2019) 69–79
77
Fig. 3. Average F1 score of NLEA and LECM on synthetic networks. Every point in the figures denotes the average of 10 executions on networks generated randomly with parameters in Table 1.
Fig. 4. Average Conductance of NLEA and LECM on synthetic networks. Every point in the figures denotes the average of 10 executions on networks generated randomly with parameters in Table 1.
The results on Youtube and Orkut datasets are demonstrated in Table 3. The sign ‘‘-’’ denotes that the algorithm cannot analyze community structure of the network in 48 h or owing to insufficient memory. On Youtube dataset, the F1 score of
78
Y. Gao, H. Zhang and Y. Zhang / Physica A 522 (2019) 69–79 Table 2 Information of the real-world networks. Networks
Vertices
Edges
Communities
DBLP Amazon Youtube Orkut
317,080 334,863 1,134,890 3,072,441
1,049,866 925,872 2,987,624 117,185,083
13,477 75,149 16,386 15,301,901
Fig. 5. Execution time on datasets of DBLP and Amazon.
Fig. 6. The Average F1 score on datasets of DBLP and Amazon.
Fig. 7. NMI on datasets of DBLP and Amazon.
LECM increases the performance by nearly 2 times compared with the baselines. LECM is faster by orders of magnitude than the baselines as well. On Orkut, LECM is the only algorithm that uncovers the community structure in a timely manner. In this dataset, the top 5000 ground-truth communities, which own the highest quality [27], are adopted as benchmark. The results demonstrate that LECM is able to analyze community structure on large networks in a timely manner with high accuracy.
Y. Gao, H. Zhang and Y. Zhang / Physica A 522 (2019) 69–79
79
Table 3 Results on datasets of Youtube and Orkut. Networks Algorithm LECM Bigclam Demon Oslom Nise-sph-ppr Nise-sph-fppr
Youtube Time 41.39 s – 14 100 s – 5024.63 s 4556.7 s
F1 score 0.130 – 0.043 – 0.036 0.046
Orkut NMI 0.020 – 0.024 – 0.001 0.001
Time 221.24 s – – – – –
F1 score 0.223 – – – – –
NMI 0.067 – – – – –
6. Conclusions In this work, we proposed a novel seeding method and three community refinement metrics. Experimental results in synthetic networks show that our refinement metrics largely enhance the quality of the identified communities. Experimental results in large real-world networks show that our approach is superior to the state-of-the-art global and local overlapping community detection methods. Acknowledgments This work was supported by The National Key Research and Development Program of China [Grant No. 2017YFB0803304]; and The National Key Research and Development Program of China [Grant No. 2016QY03D0501]. References [1] M. Girvan, M.E.J. Newman, Community structure in social and biological networks, Proc. Natl. Acad. Sci. USA 99 (2002) 7821–7826. [2] Y.-Y.Y. Ahn, J.P. Bagrow, S. Lehmann, Link communities reveal multiscale complexity in networks, Nature 466 (7307) (2010) 761–764. [3] G. Palla, I. Derényi, I. Farkas, T. Vicsek, Uncovering the overlapping community structure of complex networks in nature and society, Nature 435 (7043) (2005) 814–818. [4] E.M. Airoldi, D.M. Blei, S.E. Fienberg, E.P. Xing, Mixed membership stochastic blockmodels, Jmlr 9 (5) (2008) 1981–2014. [5] J. Yang, J. Leskovec, Overlapping community detection at scale: A nonnegative matrix factorization approach, in: Sixth ACM Int. Conf. Web Search Data Min, 2013, pp. 587–596. [6] M. Coscia, G. Rossetti, F. Giannotti, D. Pedreschi, Demon: a local-first discovery method for overlapping communities, in: KDD ’12 Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 615–623. [7] A. Lancichinetti, F. Radicchi, J.J. Ramasco, S. Fortunato, Finding statistically significant communities in networks, PLoS One 6 (2011) e18961. [8] J.J. Whang, D.F. Gleich, I.S. Dhillon, Overlapping community detection using neighborhood-inflated seed expansion, IEEE Trans. Knowl. Data Eng. 28 (5) (2016) 1272–1284. [9] M. Javed, et al., Community detection in networks: A multidisciplinary review, J. Netw. Comput. Appl. 108 (2018) 87–111. [10] C. Hsieh, I. Dhillon, Fast coordinate descent methods with variable selection for non-negative matrix factorization, in: KDD ’11 Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011, pp. 1064–1072. [11] D.D. Lee, H.S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (6755) (1999) 788–791. [12] C.-J. Lin, Projected gradient methods for nonnegative matrix factorization, Neural Comput. 19 (10) (2007) 2756–2779. [13] J. Xie, S. Kelley, B.K. Szymanski, Overlapping community detection in networks: The State-of-the-art and comparative study, ACM Comput. Surv. 45 (4) (2013) 43:1–43:35. [14] A. Lancichinetti, S. Fortunato, J. Kertész, Detecting the overlapping and hierarchical community structure in complex networks, New J. Phys. 11 (2009) 033015. [15] A. Padrol-Sureda, G. Perarnau-Llobet, J. Pfeifle, V. Muntés-Mulero, Overlapping community search for social networks, in: Proc. - Int. Conf. Data Eng, 2010, pp. 992–995. [16] R. Andersen, F. Chung, K. Lang, Local graph partitioning using pagerank vectors, in: Proc. - Annu. IEEE Symp. Found. Comput. Sci. FOCS, 2006, pp. 475–486. [17] F. Hu, et al., An algorithm J-SC of detecting communities in complex networks, Phys. Lett. A 381 (42) (2017) 3604–3612. [18] D.F. Gleich, C. Seshadhri, Vertex neighborhoods low conductance cuts and good seeds for local community methods, in: KDD ’12 Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 597–605. [19] R. Andersen, K.J. Lang, Communities from seed sets, in: WWW ’06 Proc. 15th Int. Conf. World Wide Web, 2006, pp. 223–232. [20] S. Fortunato, Community detection in graphs, Phys. Rep. 486 (3–5) (2010) 75–174. [21] Y. Gao, H. Zhang, Y. Zhang, A fast and high quality approach for overlapping community detection through minimizing conductance, in: 2016 IEEE First Int. Conf. Data Sci. Cybersp, 2016, pp. 688–693. [22] P. Jaccard, The distribution of flora in the alpine zone, New Phytol. 11 (2) (1912) 37–50. [23] J.J. Whang, D.F. Gleich, I.S. Dhillon, Overlapping community detection using seed set expansion, in: CIKM ’13 Proceedings of the 22nd ACM International Conference on Inf, 2013, pp. 2099–2108. [24] A. Lancichinetti, S. Fortunato, Community detection algorithms: A comparative analysis, Phys. Rev. E 80 (5) (2009) 056117. [25] J. Leskovec, K.J. Lang, A. Dasgupta, M.W. Mahoney, Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters, Internet Math. 6 (1) (2009) 29–123. [26] Available at: http://snap.stanford.edu. [27] J. Yang, J. Leskovec, Efining and evaluating network communities based on ground-truth, in: ICDM ’12 Proceedings of the 2012 IEEE 12th International Conference on Data Mining, 2012, pp. 745–754.