Identifying overlapping communities in networks using evolutionary method

Physica A 442 (2016) 182–192 Contents lists available at ScienceDirect Physica A journal homepage: www.elsevier.com/locate/physa Identifying overla...

Download PDF

1MB Sizes 1 Downloads 74 Views

Report

PDF Reader
Full Text

Physica A 442 (2016) 182–192

Contents lists available at ScienceDirect

Physica A journal homepage: www.elsevier.com/locate/physa

Identifying overlapping communities in networks using evolutionary method Weihua Zhan a,b,c,∗ , Jihong Guan d , Huahui Chen b , Jun Niu b , Guang Jin b a

Department of Control Science and Engineering, Zhejiang University, Hangzhou 310058, PR China

b

Department of Computer Science, Ningbo University, Ningbo 315211, PR China

c

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210046, PR China

d

College of Electrical Information and Engineering, Tongji University, Shanghai 201804, PR China

highlights • Present an encoding scheme for an overlapping partition of a network. • Present two informativeness measures for a node. • Present an coevolutionary schema between two segments over the population.

article

info

Article history: Received 18 January 2015 Received in revised form 20 May 2015 Available online 16 September 2015 Keywords: Community structure Evolutionary method Overlapping communities

abstract Community structure is a typical property of real-world networks, and has been recognized as a key to understand the dynamics of the networked systems. In most of the networks overwhelming nodes apparently live in a community while there often exists a few nodes straddling several communities. Hence, an ideal algorithm for community detection is that which can identify the overlapping communities in these networks. We present an evolutionary method for detecting overlapping community structure in the network. To represent an overlapping division of a network, we develop an encoding scheme composed of two segments, the first one represents a disjoint partition and the second one represents an extension of the partition that allows of multiple memberships. We give two measures for the informativeness of a node, and present a coevolutionary scheme between two segments over the population for solving the overlapping partition of the network. Experimental results show this method can give a better solution to a network. It is also revealed that a best overlapping partition of the network might not be rooted from a best disjoint partition. © 2015 Elsevier B.V. All rights reserved.

1. Introduction As a unified tool for studying various complex systems, networks have attracted tremendous attentions during the last ten years [1–3], with nodes representing the units and edges denoting diverse interactions between these units. In social networks edges often capture various social relations between individuals; in technology networks (such as Internet) an edge may correspond to a physical connection (or communication linkage) between two sites; in information networks an edge usually indicates the flow of information between sites.

∗

Corresponding author at: Department of Computer Science, Ningbo University, Ningbo 315211, PR China. E-mail address: [email protected] (W. Zhan).

http://dx.doi.org/10.1016/j.physa.2015.09.031 0378-4371/© 2015 Elsevier B.V. All rights reserved.

W. Zhan et al. / Physica A 442 (2016) 182–192

183

Community structure is an important topological feature of real networks, which refers to the natural clusters of nodes such that the connections within clusters are significantly more dense than those between clusters [4]. Due to the intimate affiliation with function decomposition and various dynamics of systems, community structure detection has been extensively studied. A variety of methods for detecting communities have been proposed based on different principle and heuristics, such as divisive method based on betweenness [4], methods based on modularity optimization by simulated annealing [5], spectral method [6,7] or extremal optimization [8], methods based on dynamical process including random walks [9] or synchronization [10,11], methods based on different formal definitions of community [12,13], and methods based on minimum spanning tree [14]. For all of these methods a common assumption is that community structure is a disjoint division of the network, that is, any node should only belong to a community. However, it may be not the case for many real-world networks. In scientific collaboration network, for instance, an energetic scientist would have participated several research groups with different concerns. Hence an ideal algorithm for community detection should be able to automatically find an accurate overlapping division of the network if the community structure is indeed overlapping. Recently there is growing interest in overlapping community detection [15–19]. A well-known method is the clique percolation method (CPM) [15] where a community is a union of some adjacent k-cliques (complete subgraph with k nodes) in the network. This method has been extensively applied to the analysis of social networks and biology network. How to select a best k is a practical problem to CPM since the divisions found with different k values generally differ from each other. It is notable that several methods based on extend a disjoint division of network to an overlapping division have been proposed [20,21]. Similar to detecting disjoint community, the detection of overlapping community also can be formulated as an optimization problem given an appropriate measure for the quality of an overlapping division. Nicosia et al. [16] extended the modularity to overlapping case, and then proposed a genetic algorithm to optimize their quality function. Shen et al. [22] also presented an overlapping measure and then employed the Blondel’s algorithm to optimize it. Zhang et al. successively presented a fuzzy c-means method [23] and negative matrix factorization method [24] for finding a good overlapping division. As opposed to other heuristics, evolutionary methods have a stronger ability of global search and stronger stability originated from the search mechanism based on population. Zhan et al. [25] proposed a modified adaptive genetic algorithm (MAGA), which is superior to standard genetic algorithms when applied to community detection. To deal with overlapping case, here we propose an evolutionary method for overlapping community detection, MAGA*, as an extension of the MAGA. In Section 2, we give a variation of overlapping modularity for evaluating the quality of an overlapping partition. In Section 3, we describe the evolutionary method for detecting overlapping communities in detail. In Section 4, we test the method on several real-world networks: karate network, high school network, and dolphin networks. At last, the conclusion is given. 2. Quality function for an overlapping partition To measure the goodness of a partition of a network the modularity [26] was proposed by Newman and Girvan, which has been widely used as an objective function for community detection approaches based on optimization. There also exist other measures for the quality of a disjoint partition, such as the hamiltonian of potts model [27] and absolute potts model [28], modular density [29], surprise value [30,31]. The definition of modularity is based on the idea that the true community structure of the network should correspond to a statistically surprising arrangement of edges in a network, that is, the number of actual links within communities should be significantly beyond that of expected links of a null model. Configuration model, an extensively used null model, is employed in the definition of modularity. Let L be the total number of edges in the network, ki be the degree of the node i, then in the null model the expectation of edges present between nodes i and j is Q =

ki kj 2L

. The modularity thus can be written as follows:

 1   in ec − eexp c

(1)

2L c ∈C

exp

where C refers to a partition on the network, ein c and ec

are the number of inner links in the cluster c and that of the kk

i j expectation of inner links, which are counted as i,j∈Gc Aij and i,j∈Gc 2L , respectively. Q is the sum of the difference over |C | groups of the specific partition. The maximum value of Q is 1, and a value approaching 1 indicates strong community structure. Conversely, a value approaching 0 implies weaker community structure or indivisibility. For a network with strong community structure, it normally falls in the range from around 0.3 to 0.7. Since the above definition of modularity is actually designed for simple networks, some variations have been presented for various types of network [32–35]. To evaluate the quality of an overlapping partition of the network, it requires redefining the number of inner links and the expectation of links in a cluster. The number of inner links in the cluster of c can be counted as



ein c =

 i,j∈Gc

Sic Sjc Aij ,



(2)

184

W. Zhan et al. / Physica A 442 (2016) 182–192

and the expectation number of inner links reads

2

  eexp c

=

i,j∈Gc

Sic Sjc Aij +

 i∈Gc ,j̸∈Gc

2L

Sic (1 − Sjc )Aij

.

(3)

It can be verified when exposing the constraint that a node can only belong to one community, i.e., Sic and Sjc take a value 0 or 1, then the overlapping measure recovers the modularity. There also exist alternative definitions of overlapping modularity. In the overlapping modularity presented by Zhang et al. [23], the belonging coefficient of an edge i − j is specified by the average of belonging coefficient of nodes i and j. In the overlapping modularity presented by Shen et al. [22], the second item in the numerator is missed. 3. Evolutionary method for overlapping community detection As a class of general-purpose tool to solve hard problems, evolutionary algorithms have been widely used in many scientific research and engineering areas. Several evolutionary methods have been proposed for community detection [25,36–38]. In these algorithms a chromosome encodes a partition of the network of interest, with an associated fitness value evaluating the quality of the partition. Encoding scheme is one of the key elements to success in applying evolutionary algorithms. There exist two types of encoding schemes for a disjoint partition of the network:

• Matrix-based encoding [36]. In this scheme a chromosome can be represented by a matrix M = {αi,c }, where i is a node in the network and c represents a community in a partition of the network. This matrix is normally called assignment matrix because the element αi,c indicates whether the node i is a member of the community c or not (1 means yes, 0 means no). • Locus-based adjacency encoding. This representation was proposed in Ref. [39] for clustering problem, and has been employed in Refs. [25,40] for community detection. In this encoding scheme, a chromosome consists of N loci with a locus for a node in the network, and the allele at a locus j is the label of one neighbor of the node j in the network. To deal with overlapping communities case, we first consider the representation of an overlapping partition of the network. 3.1. Encoding scheme Let the number of clusters be c in the partition, the size of the network (i.e., the number of the nodes in the networks) be n. A simple encoding scheme is the extension of the matrix-based encoding, in which a chromosome consists of n × c loci, αi,c indicates the belonging coefficient of the node i against the cluster c [16]. This representation requires c to be no less than the number of true communities of the network, and if the number of clusters significantly exceeds the true values it would result in a high cost in both time and space. On the other hand, by this encoding scheme the evolutionary method needs an extra repairing operator after performing genetic operators as the latter may produce illegal individuals. Pizzuti [40] presented an indirect representation of overlapping partition, wherein the network is translated into a line graph and a partition of the line graph corresponds to an overlapping partition of the original network. The partition of the line graph is encoded by locus-based adjacency representation, which prevents the production of illegal individuals. However, a node with degree k in the original network will produce a k-clique since a node in the line graph corresponds to an edge in the original network and a link between two nodes stands for two edges that have same ends in the original network. Hence, the line graph is always larger than the original one in size and has more links, which increase the complexity of the problem. Here, we present a coding scheme, LARO (locus-based adjacency representation for overlapping communities) for which there is no need to the transformation from the network to its line graph. This representation is based on the following idea:

• A disjoint partition can be extracted from an overlapping partition. In an overlapping partition there are a few overlapping nodes while most of the nodes are non-overlapping nodes. By assigning those overlapping nodes to only a single community, a disjoint partition can be obtained. As shown in Fig. 1, the overlapping community structure of the network is π ∗ = {{1, 2, 3, 4}, {4, 5, 6, 7, 8}, {8, 9, 10, 11, 12}}. One can obtain from the overlapping partition a disjoint one, π = {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}}. • Conversely, an overlapping partition can be induced from a disjoint partition. Take the network above as an example again. If the disjoint partition π of the network has been obtained, then the overlapping community structure can be obtained by assigning the nodes 4 and 8 to more than one communities. Actually, some recent work on identifying overlapping communities is essentially based on the presumption that a good overlapping partition of a network can be extended from a good disjoint one by determining a few overlapping nodes [21,20].

W. Zhan et al. / Physica A 442 (2016) 182–192

185

Fig. 1. LARO encoding on an example network. (a) Example network and its overlapping communities, π ∗ = {{1, 2, 3, 4}, {4, 5, 6, 7, 8}, {8, 9, 10, 11, 12}} nodes 4 and 8 are overlapping nodes. (b) Sets of nodes connected by solid links form a disjoint partition, which contains three primary communities, dashed lines 4–7 and 8–9 express the overlapping relations among the clusters. (c) LARO encoding scheme. First column is the labels of nodes, and second one represents the disjoint partition. Gray cells encode overlapping information which takes values 0 or 1, with cell (i, j) indicating whether the node i adheres to its j-2 neighbor.

For convenience, a primary partition, π , refers to a disjoint partition and an extent partition of π is an overlapping partition extended from the primary partition π , denoted by π ∗ . To represent an overlapping partition, a chromosome consists of two segments. The first segment represents a primary partition, denoted by P. By locus-based adjacency representation, it consists of n loci, each of which corresponds to a node and the allele indicates the neighbor node to which the node adheres. The second segment, denoted by O, represents the overlapping information between multiple communities. Similar to locusbased adjacency representation, the node i adhering to the node j implies that i is a member of the community in the primary partition that the node j belongs to. For a node with degree k, it owns k loci in the segment whose alleles take values of 0 or 1 indicates whether the node adheres to the corresponding neighbor. For the node i, the locus in the first segment is denoted by P (i), the loci in the second segment is denoted by O(i), and O(i, k) denotes the kth locus from O(i) which corresponds to the kth neighbor of the node i. As shown in Fig. 1(b), apart from the node 3 in the same primary community, the node 4 adheres to the node 7 (the node 4’s fifth neighbor) that lives in another primary community. In this way, the primary community {5, 6, 7, 8} becomes an extend community {4, 5, 6, 7, 8}. The node 8 likewise adheres to the node 9 outside its primary community, which makes the primary community {9, 10, 11, 12} being an extended community {8, 9, 10, 11, 12}. 3.2. Informativeness measures for nodes In standard genetic algorithms mutation is uniformly performed on all loci. A kind of novel genetic algorithms perform informative mutation, i.e., each locus mutates with a probability proportional to its informativeness [41,42]. A proper measure for the informativeness of loci is crucial to this paradigm of evolution algorithms. The allele standard deviation [41,42] can work well in optimization of continuous variable while it would often result in a problem relating to the allele values for discrete optimization. Instead, in the MAGA [25] the informativeness of a locus is measured by the bias between the actual distribution of alleles on current population and the random distribution, namely the Kullback–Leibler divergence between these two distributions. Let the distribution of alleles at the locus i be Pi , the random one be Qi , then the informativeness of locus i is

µi = KLD(Pi ∥ Qi ) =



P (Xi = x) log (|Xi | · P (Xi = x))

(4)

x

where the |Xi | is the number of the alleles at the locus i, and it is the number of neighbor of the node i in the network when applied to community detection. Aiming at the above overlapping scheme for the overlapping partition, we introduce two measures for the informativeness about nodes over the population. The first measure is Primary informativeness of node i, referring to the Kullback–Leibler p divergence of the locus in the segment P (denoted by ui ). It is used to indicate the informativeness about the primary partitions about the node i over the current population.

186

W. Zhan et al. / Physica A 442 (2016) 182–192

In contrast, the second one is overall informativeness of a node, which reflects the overall information of the node encoded by the two segments of chromosomes over the population (denoted by uoi ). Notice that the segment O actually implicates the primary partition as well. Hence, the overall informativeness can be defined only on the segment O over the population. For the node i, this measure is given by

µoi =

√ d

µi,1 · µi,2 · . . . µi,d

(5)

where d is the degree of the node i, and ui,1 , ui,2 , . . . , ui,d are the d biases of the loci of node i on the O(i). 3.3. Mutation and reassignment technology Mutation is a primary genetic operator in this algorithm. Two segments of a chromosome conduct mutation in different ways: a locus from segment P change its allele by selecting randomly a neighbor of the node, while the mutation of a locus from segment O is actually reduced to a switch operation between 0 and 1. Since the information of a node’s membership in a primary partition simultaneously appears in both segments, the mutation of a locus in the first segment should coincide with the second one. For instance, if the allele of locus P (i) is changed from j1 (the k1 th neighbor) to j2 (the k2 th neighbor), then the segment O is altered at the same time, with O(i, k1 ) inverted to 0 and O(i, k2 ) inverted to 1 if necessary.1 The reassignment technology that is designed for differentiating the informative state of a locus from its initial random state [25] is also necessary in the MAGA*. Under the LARO encoding scheme, the loci in segment P perform the same reassignment action as the MAGA while for the segment O the reassignment requires a bit complex operation. Consider that in a chromosome r the node i is an overlapping node, and is a member that adheres to the node k in the primary partition. Let the node j be the first neighbor that belongs to the same primary community as the node i. When we f (R) calculate the distribution of the locus P (i) the contribution from r,  f (r ) is reassigned to r not k. We can also write the bias on the hth (1 ≤ h ≤ di ) loci O(i, h), ui,h ,

r

µi,h = pi,h log 2 · pi,h + (1 − pi,h ) log 2 · (1 − pi,h ),    f (R)    pi,h = δi,c · δh,L(i,c )  , f (r ) r r c ∈C r

(6) (7)

i

where Cir is the set of memberships of the node i in the partition encoded by the chromosome r, L(i, c ) is the first neighbor of i that belongs to the community c. 3.4. Coevolution of primary partition and extended partition As aforementioned, the overlapping community structure of the network can be considered as a extended partition of a disjoint partition, which are encoded by the primary and extended segments of a chromosome, respectively. The extended partition is obviously entangled with the primary partition. How to evolve two segments of a population to effectively find a good overlapping partition of a network? A simple strategy is a two-step schema: first try to find a best primary partition and then attempt to find the best extension of the identified primary partition, which in spirit has been used in Ref. [20] where neither of the two steps employ evolutionary methods. The two-step strategy is possibly the simplest way to evolve the two segments of the population, and easy to implement as well, but nevertheless there is no guarantee that for any network the best overlapping partition could be extended from a best disjoint partition of the network. Instead of this strategy, in the MAGA* we develop a flexible coevolution schema for the population. This algorithm begins with a stochastic initialization of the population. After that, it evaluates the fitness values for all the individuals and the informativeness of each locus in the both segments. This algorithm sequently scans the population and selects a chromosome with a probability. If a chromosome is selected then it performs information-guided mutation, otherwise undergo random mutation. The information-guided mutation follows the rule that these loci with weak informativeness mutate while those loci with strong informativeness remain unchanged. Hence the information-guided mutation on a chromosome is performed in such way: select αr × N nodes to mutate according to the rank of overall information of all nodes; if the node i is picked up, then determine which segments of the node i have to be mutated. If the informativeness of P (i) is stronger than the overall informativeness, which means that for the node the primary partitions over the population are (or close to) a good disjoint partition while the overlapping partitions over the populations are relative stochastic. Hence the mutation is performed on the extended loci O(i), otherwise performed on the primary locus of the node P (i). If mutation is determined to be performed on O(i), then stochastic select to mutate one locus that the node is not a membership of the primary community of the node i.

1 There is no need to change if before mutation O(i, k ) = 1, i.e., the node i is also a member of the community that the node j lives in. 2 2

W. Zhan et al. / Physica A 442 (2016) 182–192

187

The overall procedure of the MAGA* can be described as follows: (1) The connectivity of the network of interest is fed into the MAGA*. The algorithm then creates NP initial feasible solutions,2 and then calculate the fitness for all chromosomes. (2) At each generation, the MAGA* first duplicates 10% of the fittest chromosomes of the previous generation for the current generation. (3) The MAGA* then reproduces 0.9NP (the size of the population) individuals by selecting them from the previous generation in proportion to their fitness to prepare for mutation. (4) The fitness and the fitness cumulative probability for chromosomes are evaluated; immediately, the primary informative and the overall informativeness for all nodes are evaluated, and then these loci are ranked according to their overall informativeness. (5) The individuals reproduced in step 3 are swept, and the chromosome r selected with the same probability 1 − C (f (r )); if the chromosome is chosen then the mutation aforementioned is performed; otherwise a local search for fitter individuals is performed. (6) Steps 2–5 are repeated until a certain termination criterion has been met. Otherwise, the MAGA* outputs the best partition with the highest fitness. Fig. 2 illustrates the flowchart of the MAGA*. 4. Experimental results 4.1. Zachary Club Network We applied the evolutionary method for overlapping community detection to several real networks. We first considered the Zachary Club Network [43]. Nicosia et al. [16] fixed the number of communities to be 2, Fig. 3(a) shows the best partition in terms of their modularity, where nodes 3 and 10 are identified as overlapping nodes. In the suboptimal solution, nodes 9, 31, 34 are identified as overlapping nodes except for nodes 3 and 10. When the upper bound for communities are set to 10, six communities are empty, while two overlapping communities that overlap at nodes 3 and 10 correspond to the classical partition, i.e., {1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 17, 18, 20, 22} and {3, 9, 10, 15, 16, 19, 21, 23, 24, 25, 26, 27, 28, 28, 30, 31, 32, 33, 34}. The remainder are the subcommunities identified by Duch et al. [44], as shown in Fig. 3(b). Shen et al. [22] gave an alternative overlapping modularity and presented a method based on clique, and they found a partition with three overlapping communities (as shown in Fig. 3(c)) when fixing k 3. While fixing k 4, the method identifies four communities as shown in Fig. 3(d). Fig. 3(e) shows the result by Zhang et al.’s [24] overlapping modularity and the cmeans method, where overlapping nodes are {1, 9, 10, 31}. Fig. 3(f) shows the result with non-negative matrix factorize method: the network are partitioned into two overlapping communities with overlapping nodes {3, 9, 20}. Obviously, various approaches give different partitions and different overlapping nodes although this network is very simple. As one can see, nodes 1, 3 and 9 are overlapping nodes. Except Fig. 3(c) identified these nodes at a time, Fig. 3(a) and (b) only identified node 3, Fig. 3(e) fails to identify node 3 while Fig. 3(f) fails in identifying overlapping node 1. For nodes 10, 20 and 31 laying at the boundary, they prefer to one of the communities though it is certain reasonable that they can be considered as overlapping nodes. Fig. 4 shows the best solution obtained by our method when applied on this ∗ network with λ0 = 0.3.3 The attained value of the modularity is Qov = 0.4564, when the network is partitioned into four overlapping communities, π ∗ = {{1, 2, 3, 4, 8, 9, 12, 13, 14, 18, 20, 22}, {1, 5, 6, 7, 11, 17}, {3, 9, 10, 15, 16, 19, 21, 23, 24, 27, 28, 29, 30, 31, 33, 34}, {24, 25, 26, 28, 29, 32}}. Apart from nodes 1, 3 and 9, the overlapping nodes include 24, 28 and 29. The primary partition of this optimal solution is just the known best disjoint partition of the network, i.e., π = {C1 , C2 , C3 , C4 }, C1 = {1, 2, 3, 4, 8, 12, 13, 14, 18, 20, 22}, C2 = {5, 6, 7, 11, 17}, C3 = {9, 10, 15, 16, 19, 21, 23, 27, 30, 31, 33, 34}, C4 = {24, 25, 26, 28, 29, 32}, for which the modularity is 0.4198. Institutionally, it is preferable that communities C3 and C4 appear as two communities than merging into a group. On the other hand, nodes 24, 28 and 29 belong to C4 while they are densely connected to C3 indeed. Thus, it is reasonable that they are viewed as the overlapping nodes as the two primary communities. 4.2. High school network To compare the MAGA* with the state of the art methods for overlapping community detection including CPM, COPRA [45], Game [46], NMF [47], OSLOM [48], SLPA [49], and GCE [50], we examined these algorithms on a high school friendship network whose true partition is known. This network is constructed from the self-reporting of students, and the true partition is based on their own grades. Fig. 5 shows the real communities corresponding to the grades of the students

2 Each locus in the primary segment of which is initiated with a random allele. Let node i be adhered to its jth neighbor in a chromosome, then the locus O(i, j) is set to 1, and other loci for the node in the overlapping segment initiated with 0 or 1 each with equal probability (0.5). 3 For any node i and community c, if S ≥ λ then the node i is recognized as a member of the community c. S can be calculated as the number of the ic

0

ends that the nodes i has in the community c against the degree of i.

ic

188

W. Zhan et al. / Physica A 442 (2016) 182–192

Fig. 2. Flowchart of the MAGA*. s is a uniform random number between [0, 1], which is used to simulate the probability 1 − C (f (r )) for the individual r to perform the informative mutation.

Table 1 Results on a high school network. Algorithm

Num. of communities

Overlapping nodes

NMI

CPM COPRA Game GCE OSLOM Link NMF SLPA MAGA*

2 6 10 6 11 20 7 6 6

12, 18 Total 14 Total 14 0, 21, 45, 46, 61 45, 46 Total 31 0, 12, 18, 45 1, 42, 45, 69 31, 42, 45, 61

0.1679 0.7966 0.4673 0.8333 0.4315 0.3155 0.643 0.6718 0.8149

ranging from 6 to 12. Even though there are no overlapping communities according to the report of the students, based on the social connections each algorithm identified some overlapping occurring as shown in Table 1. For an algorithm that identified more than 10 overlapping nodes, only the total number is listed. The MAGA*, GCE, OSLOM and SLPA gave the true number of communities. It is notable that MAGA* and GCE produced a partition more close to the ground truth in terms of the NMI.4 Fig. 5 shows the partition identified by the MAGA* in detail. The algorithm exactly recovered four grades (grades 7–10), and there is a subtle difference between the community V and the grade 11. The main distinction is between the community VI, {1, 31, 61, 63, 67, 68}, and grade 12 ({1, 67, 68, 69}). However, all the students of grade 12 are seemingly insufficient to together form a single community from the view

4 Apart from the optimal solution (Q ∗ = 0.6251) obtained by the MAGA*, we also obtained a sub-optimal solution (Q = 0.6246) with a higher NMI, ov ov 0.8273. The difference between the solutions is that the student 45 is only assigned to his grade community (grade 8) in the sub-optimal solution.

W. Zhan et al. / Physica A 442 (2016) 182–192

189

Fig. 3. Communities are identified by the colors of nodes, and the nodes enveloped by two dashed lines are overlapping nodes. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

of topological structure. For instance, the associations among them are weak while both 61 and 63 have higher strong associations to the group. It seems that the partition is a better solution where the group {1, 31, 61, 63, 67, 68} constitutes a community. Let us focus on the algorithms that identified less than ten overlapping nodes. CPM found a partition which is distinct from the known partition in terms of NMI. In the partition nodes 12 and 18 are identified as overlapping nodes. All the other algorithms identified node 45 as overlapping. The overlapping nodes identified by the MAGA* are 31, 42, 45 and 61. Community VI overlaps at node 61 with community III while overlaps at node 31 with community V. Nodes 42 and 45 are the overlapping nodes between Community II and III. SLPA identified the closest sets of overlapping identified to the MAGA*, {1, 42, 45, 59}, but it has a lower NMI relative to the true partition.

190

W. Zhan et al. / Physica A 442 (2016) 182–192

Fig. 4. Overlapping communities of Zachary Club Network identified by MAGA*. Nodes in the same shades indicate a community of the network. The nodes circled by two dashed lines represent overlapping nodes.

Fig. 5. High school network and the overlapping community structure. Colors indicate the known communities corresponding to grades ranging from 7 to 12. Dashed lines indicate the partition identified using the MAGA*. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

4.3. Dolphin network The third real network is the dolphin network, which is constructed by the observation of the association patterns of 62 bottlenose dolphins lived in Doubtful Sound between 1994 and 2001 [51,52]. Nodes in the network represent dolphins while the link indicates frequent association between a pair of dolphins. We first examined the disjoint partition of this network using MAGA. The best solution is to partition this network into five communities with the modularity 0.5285. As a well-known social network of animals, it is interesting to explore whether this network exhibits overlapping communities. Fig. 6 shows the best solution to this network using our method, with λ0 = 0.3. It partitions this network into four overlapping communities, with the highest modularity 0.5422, higher than the modularity for the best disjoint partition which divides the network into five groups. Obviously, the overlapping is significant for this network. Knit and DN63 are the overlapping nodes between the communities with box and circle. Zipfel and Thumper are the overlapping nodes between the circle community and diamond community, while TR99 and CCL are the overlapping ones between diamond and triangle communities. We can observe that some nodes between three overlapping communities, such as SN100, Double and Kringel

W. Zhan et al. / Physica A 442 (2016) 182–192

191

Fig. 6. Overlapping community structure in Dolphins network. Shapes indicate primary partition, red nodes indicate overlapping nodes. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

shared by three communities. A disjoint partition of a network can be viewed as a special overlapping partition of the network; it thus is a better solution to the division of the network only if the disjoint partition has greater values of the overlapping modularity. Furthermore, at each generation the population allows of the coexistence of division partitions and overlapping partitions. Hence, the procedure of the MAGA* is a coevolution of disjoint partitions and overlapping partitions. This method is superior to such approaches that first find a best disjoint partition and then extend it to an overlapping partition. As we can see that for the Dolphin network, the best partition is an overlapping partition, and it is remarkable that the overlapping partition is not an extension of the best disjoint partition identified by the MAGA. 5. Conclusion Community structure pervasively exists in real networks, and the overlapping between communities is so ubiquitous in many networks. The structural overlapping of modules of systems has implications on the interaction between functions of the modules, and overlapping community detection facilitates a precise analysis of the dynamics of systems. To develop approaches for overlapping community detection will find many applications. In this paper, we present an evolutionary method for overlapping communities. We have developed an effective encode scheme for overlapping communities, and introduced two measures for the informativeness of nodes. We next presented a coevolution schema to find a high quality solution to the network. Experimental result on several real-world networks shows that the method can give a better overlapping partition. Moreover, it reveals that this method could find a best partition probably missed by those methods which find an overlapping partition based on locally extending a disjoint partition. We also have given a variation of overlapping modularity. Experiments on this study indicate that through optimizing the measure the method can yield a better partition. Of course, as the modularity and other versions of overlapping modularity, it could suffer from the resolution limit [53] when especially applied to a large heterogeneous network. This limit could be mitigated by some existing techniques [54,55]. Moreover, although our method is based on the optimization of the overlapping modularity, it can conveniently replace the fitness function with a new quality function for the overlapping partition. When applied to a large network, the efficiency of the evolutionary method needs to be considered. It appears that the method combined with multi-level technique [56,57] would be a good way to tackle this problem.

192

W. Zhan et al. / Physica A 442 (2016) 182–192

Acknowledgments This research was supported by the Natural Science Foundation of Zhejiang (LY13F020038), the Natural Science Foundation of Ningbo (2013A610112), and the K.C. Wong Magna Fund in Ningbo University. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57]

R. Albert, A.-L. Barabási, Rev. Modern Phys. 74 (2002) 47. S.N. Dorogovtsev, J.F.F. Mendes, Evolution of Networks: From Biological Nets to the Internet and WWW, Oxford University Press, Oxford, 2003. S. Boccaletti, et al., Phys. Rep. 424 (2006) 175. M. Girvan, M.E.J. Newman, Proc. Natl. Acad. Sci. USA 99 (2002) 7821. R. Guimerà, L.A.N. Amaral, Nature 433 (2005) 895. M.E.J. Newman, Proc. Natl. Acad. Sci. USA 103 (2006) 8577. M.E.J. Newman, Phys. Rev. E 74 (2006) 036104. L. Danon, A. Díaz-Guilera, J. Duch, A. Arenas, J. Stat. Mech. (2005) P09008. H.-J. Zhou, Phys. Rev. E 67 (2003) 041908. A. Arenas, A. Díaz-Guilera, C.J. Pérez-Vicente, Phys. Rev. Lett. 96 (2006) 114102. J.-S. Wu, Y. Jiao, Chaos 24 (2014) 033104. F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, D. Parisi, Proc. Natl. Acad. Sci. USA 101 (2004) 2658. Y.-Q. Hu, H.-B. Chen, P. Zhang, M.-H. Li, Z.-R. Di, Y. Fan, Phys. Rev. E 78 (2008) 026121. J.-S. Wu, X.-X. Li, L.-C. Jiao, X.-H. Wang, B. Sun, Physica A 392 (2013) 2265–2277. G. Palla, I. Derényi, I. Farkas, T. Vicsek, Nature 435 (2005) 814. V. Nicosia, G. Mangioni, V. Carchiolo, M. Malgeri, J. Stat. Mech. (2009) P03024. J.-S. Wu, L.-C. Jiao, C. Jin, F. Liu, M.-G. Gong, R.-H. Shang, W.-S. Chen, Phys. Rev. E 85 (2012) 016115. F.-H. Zhu, W.-X. Wang, Z.-R. Di, Y. Fan, PLoS One 9 (2014) e97021. J.-R. Xie, S. Kelley, B.K. Szymanski, ACM Trans. Comput. Surv. 45 (2013) 43. F. Wei, W.-N. Qian, C. Wang, et al., World Wide Web 12 (2009) 235–261. X.-H. Wang, L.-C. Jiao, J.-S. Wu, Physica A 388 (2009) 5045–5056. H.-W. Shen, X.-Q. Cheng, J.-F. Guo, J. Stat. Mech. (2009) P07042. S.-H. Zhang, R.-S. Wang, X.-S. Zhang, Physica A 374 (2007) 483. S.-H. Zhang, R.-S. Wang, X.-S. Zhang, Phys. Rev. E 76 (2007) 046103. W.-H. Zhan, Z.-Z. Zhang, J.-H. Guan, S.-G. Zhou, Phys. Rev. E 83 (2011) 066120. M.E.J. Newman, M. Girvan, Phys. Rev. E 69 (2004) 026113. J. Reichardt, S. Bornholdt, Phys. Rev. E 74 (2006) 016110. P. Ronhovde, Z. Nussinov, Phys. Rev. E 81 (2010) 046114. Z.-P. Li, S.-H. Zhang, R.-S. Wang, et al., Phys. Rev. E 77 (2008) 036109. A. Rodrigo, M. Ignacio, PLoS One 6 (2011) e24195. A. Rodrigo, M. Ignacio, Sci. Rep. 3 (2013) 1060. A. Arenas, J. Duch, A. Fernandez, S. Gomez, New J. Phys. 9 (2007) 176. E.A. Leicht, M.E.J. Newman, Phys. Rev. Lett. 100 (2008) 118703. R. Guimerà, M. Sales-Pardo, L.A.N. Amaral, Phys. Rev. E 76 (2007) 036102. M.J. Barber, Phys. Rev. E 76 (2007) 066102. M. Tasgin, H. Bingol, e-print arXiv:0711.0491. C. Pizzuti, Lecture Notes in Computer Science, vol. 5199, 2008, pp. 1081–1090. M.-G. Gong, B. Fu, L.-C. Jiao, H.-F. Du, Phys. Rev. E 84 (2011) 056101. Y. Park, M. Song, in: J.R. Koza (Ed.), Proceedings of the Third Annual Conference on Genetic Programming, Morgan Kaufmann Publisher, Los Altos, CA, 1998, pp. 568–575. C. Pizzuti, Proc. of The 11th Annual Conference on Genetic and Evolutionary Computation, ACM, 2009, pp. 859–866. K.Y. Szeto, J. Zhang, Proc. of the 5th International Conference on Large-Scale Scientific Computing, 2006, pp. 189-196. N.L. Law, K.Y. Szeto, Proc. of the 20th International Joint Conference on Artificial Intelligence, AAAI Press, 2008, pp. 2330–2334. W.W. Zachary, J. Anthropol. Res. (1977). J. Duch, A. Arenas, Phys. Rev. E 72 (2005) 027104. S. Gregory, New J. Phys. 12 (2010) 10. W. Chen, Z. Liu, X. Sun, Y. Wang, Data Min. Knowl. Discov. 21 (2010) 224. I. Psorakis, S. Roberts, M. Ebden, B. Sheldon, Phys. Rev. E 83 (2011) 6. A. Lancichinetti, F. Radicchi, J.J. Ramasco, S. Fortunato, PLoS One 6 (2011) 4. J.-R. Xie, B.K. Szymanski, X. Liu, Proc. of the 11th IEEE International Conference on Data Mining Workshops, ICDMW’11, 2011, pp. 344–349. C. Lee, F. Reid, A. Mcdaid, N Hurley, Proc. of the 4th Workshop on Social Network Mining and Analysis, 2010, pp. 33–42. D. Lusseau, Proc. R. Soc. Lond. Ser. B 54 (2003) S186–S188. D. Lusseau, M.E.J. Newman, Proc. R. Soc. Lond. Ser. B 271 (2004) S477–S481. S. Fortunato, M. Barthelemy, Proc. Natl. Acad. Sci. USA 104 (2007) 36. A. Arenas, A. Fernández, S. Gómez, New J. Phys. 10 (2008) 053039. D.-L. Lai, H.-T. Lu, C. Nardini, Phys. Rev. E 81 (2010) 066118. R. Rotta, A. Noack, ACM J. Exp. Algorithmics 16 (2011) 2.3. U. Benlic, J.-K. Hao, IEEE Trans. Evol. Comput. 15 (2011) 624–642.

Identifying overlapping communities in networks using evolutionary method

Identifying overlapping communities in networks using evolutionary method

Recommend Documents