Overlapping community detection via preferential learning model

Overlapping community detection via preferential learning model

Physica A 527 (2019) 121265 Contents lists available at ScienceDirect Physica A journal homepage: www.elsevier.com/locate/physa Overlapping communi...

1MB Sizes 0 Downloads 73 Views

Physica A 527 (2019) 121265

Contents lists available at ScienceDirect

Physica A journal homepage: www.elsevier.com/locate/physa

Overlapping community detection via preferential learning model ∗

JinFang Sheng a , Kai Wang a , ZeJun Sun a,b , Bin Wang a , , FaizaRiaz Khawaja a , Ben Lu a , JunKai Zhang a a b

School of Computer Science and Engineering, Central South University, Changsha, Hunan, China Department of Network Center, Pingdingshan University, Pingdingshan, Henan, China

highlights • • • • •

Accuracy. Accurate overlapping community detection, and accurate community number. Robustness. Stable performance in sundry networks, especially very sparse network. Stability. Lower randomness compared to other algorithms based on label propagation. Scalability. Suitable for large-scale networks because of its lower time complexity. Free parameters. Not required for any parameter settings.

article

info

Article history: Received 23 December 2018 Received in revised form 19 April 2019 Available online 27 April 2019 Keywords: Overlapping community detection Preferential learning Label propagation Information dynamic Cluster

a b s t r a c t Overlapping community detection has become one of the most important tasks in network analysis because it can better reflect the characteristics of the real network structure. Research on overlapping communities detection cannot only promote the study of network functions but also bring insight into a deep understanding of the network topology. In this paper, we get inspiration from learning behaviors and information exchanges in the real world, and propose a dynamic relationship-based preference learning model applied to dynamic systems. We apply this model to the label propagation algorithm and present an overlapping community detection algorithm based on Preferential Learning and Label Propagation Algorithm, called PLPA. The algorithm regards the network as a dynamic system. Each node selects the learning target to update its own label according to the degree of preference to its neighbor nodes. With learning, the information in the system will finally reach a steady state. We consider nodes that have the same label belonging to the same community, so that the overlapping community structure in the network will be separated. In the experiments, we verified the performance of our algorithm through real-world and synthetic networks. Results show that PLPA not only has better performance than many state-of-the-art algorithms on most data sets, but it is also more applicable to some networks with ambiguous community structure, especially sparse networks. © 2019 Elsevier B.V. All rights reserved.

1. Introduction In recent years, the research on complex networks has received widespread attention [1–3] from the academic field to commercial areas. Complex networks, represented by social networks, transportation networks, biological networks, ∗ Corresponding author. E-mail address: [email protected] (B. Wang). https://doi.org/10.1016/j.physa.2019.121265 0378-4371/© 2019 Elsevier B.V. All rights reserved.

2

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265

Internet link networks, paper cited network networks, and communication networks have penetrated into all important aspects of our daily life. One of the hot topics in the study of complex networks is community detection [4–6]. The concept of community can be defined as the links between the members of the community are closer than the links outside the community. However, in real life, some people may belong to different communities at the same time. For example, someone is a basketball lover as well as a member of a research team. Therefore, the overlapping community detection research is more in line with the actual situation. Accurate and rapid detection of overlapping communities in large complex networks have very important theoretical and practical values. For example, it helps biologists understand the meaning of protein networks, helps criminal police discover potential criminal gangs, and helps provide more accurate marketing strategies. The overlapping community discovery field has caused a long-lasting and extensive discussion among complex network researchers. A large number of exciting algorithms tried to solve this problem from different perspectives [7–9], such as graph theory, modularity optimization, game theory, dynamic methods, etc. Each of these proposed algorithms that has its own advantages and disadvantages. In these methods, the algorithms based on the dynamic methods [10,11] are widely used because of its simplicity, high accuracy and practical significance. The basic idea is to regard the networks as a dynamic system. In the interaction process, the nodes interact through the defined interaction rules. The community structure is naturally divided after the dynamic system reaches a stable state. In a sense, COPRA [12] is a classic graph cluster approach based on the idea of dynamic method. However, the COPRA algorithm has some limitations, such as monster community and unstable results. In this article, we propose an overlapping community detection algorithm, called PLPA, based on the idea of dynamic method to discovery of overlapping community structures in networks. This algorithm treats the network as a dynamic information system. Nodes in the network communicate with neighbors through a given policy, and the overlapping community structure will be clustered naturally after the system is stabilized. 1.1. Basic idea In real life, what kind of people do you prefer to acquire knowledge from? My viewpoint seems to be mainly to gain knowledge from the following three types of people: (1) People with more social influence and social resources [13]: A phenomenon widely used in the field of economics and education, called peer effect [14], tells us that companies and individuals are more likely to learn from partners who are the voice of their field. In the academic network, the views of scholars with greater influence and higher academic recognition are more easily disseminated within the academic circle, and we are more inclined to accept their opinions. (2) People who are closer to you: For example, we communicate with colleagues in the same department more often than with colleagues in other departments, whether the amount of information exchange or the frequency of communication. Therefore, in daily communication, we communicate with people who often interact with us more frequently. When making decisions, the opinions of friends and relatives often play an important role. It also shows that the more intimate the relationship, the greater the impact. (3) People who are similar to your professional background or who have a similar knowledge: In social networks, we are more likely to communicate with people with the same interests and professional backgrounds. Because it is easy to accept the opinions of people who have same professional background and interests as us. However, in the dynamic system, the information possessed by the node and the relationship with the external node will constantly change as the interaction phase goes in depth. There are many examples in biology called positive feedback system. A common example is that when a lake is contaminated, the number of fish will be reduced because of the deterioration of water quality, fish decay after death will worsen lakes pollution degree and cause more fish deaths. It can be seen that the relationship between the single part and its surroundings in a dynamic system is changing dynamically. We regard the topological network as a dynamic system, and nodes communicate with each other through learning behaviors. In the process of information exchange in real life, people’s learning strategies and learning behavior are easily affected by their surrounding. It is not difficult to understand that we tend to choose to exchange information with people who are more influential, have more intimate relationship with us, and more similar to our knowledge structure. With a more realistic viewpoint, the relationship between people and the surrounding is typically changing [15]. Sun et al. [16] introduces the preference learning mechanism based on dynamic relations into the field of game theory. The preferred learning mechanism based on dynamic relationships mentioned here is that the relationship will affect the individual’s learning strategy and thus affect the individual’s behavior [17], and the individual’s learning behavior will in return affect the relationship. So the relationships here are also changing dynamically. A diagram of the preferential learning mechanism based on dynamic relations is shown in Fig. 1. Based on the above mentioned dynamic relationship-based preferential learning mechanism, the clustering process of the information dynamic system is described in the following steps: the first step is the system initialization process, and each node will have a unique ID, which is initialized by the category of knowledge that the node initially owns. The next step is the information exchange process, it is obvious that the knowledge category of more influential nodes in the dynamic system will spread farther. In the process of node learning information from the neighbors, the nodes are more

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265

3

Fig. 1. Preferential learning mechanism based on dynamic relationship.

Fig. 2. The workflow of the PLPA, the color of the node identifies the information carried by the node, and the thicker the edges between nodes, the stronger the preference between nodes. (a) In the initialization phase, the node carries different information, and the preference selection strength between the nodes is same; (b) In the process of information interaction, the information between the nodes is constantly exchanging, and the nodes that belongs to the same community are more intimate with each other. (c) When the system reaches a stable stage, the preference between nodes within a community is much greater than between different community nodes, thus learning behavior between nodes within a community is much more frequent than between communities. This coincides with the definition of the community. Finally, the nodes with some information are considered to be the same community, and the nodes belonging to multiple communities are considered to be overlapping nodes.

likely to accept information from the node that is in more close connection with themselves and more similar to their knowledge category. After the information exchange process, the knowledge categories owned by each node tend to be stable. The nodes that have the same information category after the iteration are naturally clustered to be in the same community. Fig. 2 shows the workflow of PLPA algorithm on a simple network. 1.2. Contributions The previously proposed overlapping community detection algorithm based on label propagation is randomly selected during label propagation, which ignores the network topology and the similarity between nodes. Unlike previous algorithms, the proposed algorithm introduces a preference selection mechanism when it comes to overlapping community detection, which means that each node chooses a label based on relationship, and this relationship is constantly changing as the interaction progresses. Because of the introduction of preference selection mechanism and full consideration of the network topology (including the characteristics of every node and the relationship between nodes) and the similarity holding label, the PLPA has many outstanding performances in the area of overlap community detection. The most prominent are the following three points: (1) Accurate and effective overlapping community detection. The PLPA algorithm can find high-quality community structures in both real-world and synthetic networks (cf. Figs. 4a, 5a, 6a), and the number of found communities is closer to the actual situation (cf. Figs. 4b, 5b, 6b). (2) Robustness. Experiments show that the PLPA algorithm has a very stable performance in networks with different network characteristics (cf. Figs. 4a, 5a, 6a), even under some harsh network characteristics, especially very sparse network, where other algorithms are struggling. (3) Stability. Compared to other algorithms based on label propagation, its randomness is greatly reduced. (cf. Figs. 4c, 5c, 6c). (4) Scalability. PLPA can be applied to large-scale networks because of its lower time complexity.

4

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265

(5) Free parameters. Instead of depending on prior knowledge or parameter setting algorithms, the PLPA algorithm does not require any parameter settings. The overlapping community structure will be naturally and automatically separated by the local topology of network. The remaining of the paper is organized as follows. In the next section, we briefly introduce the related works of overlapping community detection algorithms. In Section 3, we focus on the details of the preference learning model based on dynamic relationship and the PLPA algorithm. In Section 4 we compare our algorithm with several other representative algorithms on real-world and synthetic networks. Finally, the conclusions will be presented in Section 5. 2. Related work To detect overlapping community structures in complex networks, a large number of excellent algorithms have emerged over the past decades. The earliest research in this field was proposed with the Clique Percolation Method (CPM) in 2005 [18]. CPM is based on the concept of complete subgraphs. This algorithm first finds a complete subgraph of size k in the network, where k is a preset parameter, then builds overlapping communities by searching for neighboring subgraphs. Regrettably, CPM is only suitable for networks with densely connected parts. Other researchers also proposed their algorithms based on the idea of maximal cliques [19,20]. In another line, LFM [21] is based on maximizing the local fitness function by expanding a community from a random seed. The resolution can be tuned by a parameter enabling different hierarchical levels of on organization to be investigated. Regrettably, the performance of LFM deeply depends on the parameter that controls the size of communities. Moreover, the results of LFM are not stable because of the randomness of seed selection. There are also some algorithms called Dynamics-Based methods that are widely used because of their high accuracy [22]. Dynamics-Based methods usually expound a dynamic process or iteration rules to detect the community structure. COPRA [12] was developed by Gregory in 2010 extended from the label propagation algorithm (LPA) [23]. The core idea of this algorithm is to allow each node to carry multiple labels. It means that each node can belong to multiple communities at the same time. Each time the label updating process randomly sets the order of node updates, and nodes randomly learn and normalize the labels from the neighbor nodes in the process of label propagation. The disadvantage of this algorithm is that it is easy to divide multiple small communities. In addition, SLPA [24] defines a dynamic iteration rule, called speak rule and listen rule, for nodes in the exchange of labels. There are also some algorithms based on the dynamic system concept, such as particles walk [25], synchronization [26,27], distance dynamic [10,28] and so on. Another line of research in overlapping community detection is based on game theory [29–31], for example, the approach based on game theory proposed by L Zhou [32] used to identify overlapping and hierarchical communities. This approach model community detection as a coalition formation game, in which individuals in a social network are modeled as rational players, aims to improve the group’s utilities by cooperating with other players to form coalitions. Each player is allowed to join multiple groups, and those groups with fewer players can merge into a larger coalition, as long as the merge operation is beneficial to improve the merged group’s utilities thus overlapping and hierarchical communities can be revealed simultaneously. Because game theory provides a set of formal analysis framework and mathematical tools to study the complex interactions between rational participants, the game theory was applied to detect the community help to identify community. However, it is unfortunate that none of those algorithms can perform well in stability. 3. Overlapping community detection via dynamic relationship-based preferential learning mechanism The overlapping community detection algorithm based on the preferential learning mechanism proposed by us is actually based on the idea of label propagation. As the most important task for the idea, LPA [23] is an efficient, near-linear non-overlapping community detection algorithm. However, the huge randomness is a drawback that cannot be ignored. The algorithm is mainly divided into three steps (this point is also referred to in this paper), including the initialization phase, the label propagation phase, and the community division phase. The following is the basic flow of LPA: (1) Label Initialization phase: Each node is assigned a unique label, usually a node ID, and each node can only hold one label. (2) Label propagation phase: Each node label update its label according to the most label holding by its neighbors in a random order. A random label will be selected if there are multiple most label. Disjoint communities are separated when the algorithm converges, which signify the state of an unchanging state of each node. (3) Community division phase: When the system iteration is terminated, the nodes with the same label are considered to be the same community. Our plan is to add some concepts based on the idea of label propagation, such as preference degree and label similarity, making it more purposeful to update label through label propagation phase in order to improve accuracy and stability.

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265

5

Table 1 A summary of the notations. Symbol

Definition

n m b(c , u) J(x, y) N(x) C (x) CS(x, y) LS(x, y) Preferences(x, y) P(x, yi )

The number of nodes (n = |V |) The number of edges (m = |E |) The belonging coefficient of node u which belongs to community c Jaccard similarity coefficient of node x and node y The number of neighbors of node x The local clustering coefficient of node x The contact strength of node x and node y The label similarity of node x and node y Node x’s preference for node y The probability that node x selects node y as a learning target

3.1. Principle definition and concepts Before introducing our proposed algorithm, we first give some basic concepts that will be used in this article. These concepts are often used in the field of complex networks. All concepts are briefly listed in Table 1. Let G = (V , E) be given network, where V is the set of nodes and E is the set of edges. Definition 1 (Belonging Coefficient). In fact, this is a concept borrowed from the COPRA [12]. Each node has a series of belonging coefficients b(c , u), which represents the belonging coefficient of node u which is a member of community c. Obviously, the sum of all the belonging coefficients of node u is 1. Definition 2 (Jaccard Similarity Coefficient). In the given undirected network G = (V , E), the Jaccard similarity coefficient of node x and node y is defined as (1): J(x, y) =

|τ (x) ∩ τ (y)| |τ (x) ∪ τ (y)|

(1)

where τ (x) represents a set of node x and its neighbor nodes, which can be defined as (2):

τ (x) = {y ∈ V |(x, y) ∈ E } ∪ {x}

(2)

Definition 3 (Local Clustering Coefficient). In the given undirected network G = (V , E), the local cluster coefficient of node x is defined as (3): C (x) =

2n

|N(x)| ∗ (|N(x)| − 1)

(3)

where |N(x)| represents the number of neighbors of node x, and n represents the number of edges between the neighbors of node x. Definition 4 (Contact Strength). The relationship between humans is usually including strong tie and weak tie in real world, which has great significance for information exchange and community formation. We introduce the concept of connection strength that illustrates to characterize the closeness between nodes. In the given undirected network G = (V , E), the contact strength of node x and node y is defined as (4): CS(x, y) =

|N(x) ∩ N(y)| Tx

(4)

where Tx represents the number of triangles that contains node x, and |N(x) ∩ N(y)| represents the number of triangles shared by node x and y. 3.2. Preferential learning model based on dynamic relationship Based on the basic concepts introduced above, we propose a preferential learning model based on dynamic relationships, which includes many concepts, such as information learning range, label similarity, preference degree, propagation probability, and information loss. In real life, we generally prefer to contact with neighbors rather than others. Based on this idea, we set each node in the model to learn only from neighboring nodes. Based on this characteristic, the local iterative approach is also applicable to the overlapping community discovery problems of large-scale networks. In addition, it is not difficult to understand in our daily life that one person is more willing to communicate with people with similar professional background or the same interests, and it is easier to understand the opinions they express. So how do we describe the similarity of two people in professional background or hobbies? We propose a label similarity formula between two nodes to solve

6

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265

this problem. Furthermore, in real world, there are different learning strengths for the knowledge contained in people around us. We define the so-called preference degree based on concepts such as degree, clustering coefficient, connection strength, and label similarity to characterize the degree. During each iteration, the node selects the learned target from its neighbors based on the strength of the preference. The stronger the preference for a certain neighbor node, the greater the probability of learning from this node. We propose a propagation probability model to characterize this probability. Of course, not all knowledge learned from neighbors will be preserved, just as we will forget about less important or less interesting information in real life. After all the information including learned information and original information is integrated, we will delete the label below a certain threshold, and then normalize all the knowledge categories. Therefore, we will formulate the preferential learning model from label similarity, preference degree, propagation probability, and information loss respectively. (1) Label similarity. As mentioned above, each node in a given graph has a series of labels, and each of which corresponds to one belonging coefficient. Let b(c , u) represent the belonging coefficient of the node u to label c. So we can define LS(x, y) indicating the label similarity between node x and node y as (5) follows. LS(x, y) =



b(Cx , x) ∗ b(Cy , y)

(5)

all Cx =Cy

Among them, Cx represents the label that node x belongs to, and the range of LS(x, y) is [0,1]. In addition, we can regard labels as the knowledge or interest categories owned by everyone in the social network. The larger value of Lb(x, y), the more similar educational background and interests between node x and node y. Many communities in real life are formed by some people with the same major or common interests. The value of the label similarity is also an important reference when we choose learning objectives. As the iteration progresses, the type of labels contained in each node and the belonging coefficient to each label changes. Correspondingly, the label similarity between nodes also changes with iterations. (2) Preference degree. In the dynamic relationship-based preferential learning model we proposed, it is very important to introduce the concept of preference degree in order to avoid the randomness of previous algorithms based on the LPA [23]. This can provide a selection basis for nodes when selecting learning target. According to the abovementioned preference learning phenomenon in real social networks, people usually choose the following three types of people when they choose to communicate: people who have higher social resources or field status, people who are more in touch with us, and people with more similar knowledge backgrounds with us. we propose a preference learning formula for detecting overlapping communities to discover the degree of preference for nodes with their neighboring. We define the degree of preference of node x for node y as Preference (x, y) (6): Preference(x, y) =

{

|N(y)| ∗ ec(y)+LS(x,y)+CS(x,y) ∗ (J(x, y) − ε ) 0

J(x, y) > ε J(x, y) ≤ ε

(6)

where N(y) denotes the number of neighbors of node y, c(y) denotes the local clustering coefficient of the node y, J(x, y) and CS(x, y) denote the Jaccard similarity and connection strength between nodes x and node y respectively, LS(x, y) denotes the label similarity coefficient of node x and node y. In particular, ε is a constant, usually 0.05, and the purpose of introducing ε is to eliminate the learning behavior between two nodes with very low Jaccard similarity coefficient. We also tried to change the ε value, but did not achieve better results. We try to map the rules of information exchange in real life to the overlap community detection in complex network. We believe that the degree of node represents the resources volume of the node. Moreover, the local clustering coefficient indicates the probability that the node becomes the core of community. Further, the Jaccard similarity coefficient and the connection strength indicates the degree of connection between nodes, and the label similarity reveals the similarity between nodes about the category they belong to. (3) Propagation probability. According to the preference degree formula proposed above, node x has a particular preference degree to each of its neighboring nodes. Let N(x) denote the set of neighboring nodes of node x, and yi belongs to N(x). When node x needs to select a node from N(x) to exchange information, the higher the degree of preference of node x to yi is, the greater is the probability of selecting yi node is. In other words, each node will have a probability of being selected. It does not mean that a node with a low degree of preference cannot be selected. In a way, this is like we still have the possibility to learn from people who are not important or not contact us often. We define the probability P(x, yi ) of selecting yi nodes as learning objects for node x as (7): P(x, yi ) = ∑

Preference(x, yi )

yi ∈N(x)

Preference(x, yi )

(7)

(4) Information loss. In real life, information exchange and information storage between people are accompanied by information loss. After a round of learning behavior, nodes will integrate and normalize knowledge, which includes learning from neighbors and its own original. After the operation, the label whose belonging coefficient below the threshold will be deleted. A large number of experiments [12,24] have proved that how to set the threshold is especially important. Referring to the processing method of the similar scenario in COPRA [12], which introduces

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265

7

the parameter v, it specifies that after the completion of an iteration, the pair whose belonging coefficient is less than 1/v will be deleted, in other words, where v represents the maximum number of communities to which any vertex can belong. There are certain reasons for this approach, but there are also deficiencies: (1) The parameter v is introduced in the algorithm. During the operation of the algorithm, the value of v will have a greater impact on the results of the community division, and how to get the best experimental effects by the value of parameter v? This is still a difficult problem to solve at the moment. (2) The introduction of parameter v makes the same maximum number of communities to which any node can belong to no matter it is a node with a large or small degree. This is inconsistent with reality. It is hard to imagine a situation where a node with a degree of 1 is owned by five communities. In this paper, 1/N(x) is used as a threshold for deleting labels, where N(x) is the number of neighbors of node x, that means that node x can belong to N(x) communities at the same time. Moreover, it does not enter the parameters in a more realistic manner. 3.3. Convergence of preferential learning model based on dynamic relationship In the iterative process, the number of labels in the network decreases with the information loss, and eventually the number of labels in the network will be reduced to a minimum. So when does the algorithm stop? The previous algorithm probably chooses one of the two ways: The first way, adopted by SLPA [24], is to set a maximum number of iterations ITERMAX , and the algorithm stops when the number of iterations reaches ITERMAX . Another way used by LPA [23] is that the algorithm will be stopped when the label list it of each node in the current iteration is exactly the same as in the last iteration it − 1. However, neither method can be used as a termination condition for our algorithm because of their disadvantages. The first one fixed ITERMAX cannot be applied to networks of different sizes. The disadvantage of the latter one is that it may be reduced again after several information exchanges. We chosen the method borrowed from COPRA [12] is to compute the minimum number of nodes labeled with each community identifier. This iteration termination condition will be triggered in a limited number of iterations, which ensures that the algorithm is bound to be stopped. 3.4. PLPA algorithm In this section, we will further elaborate on the preference learning algorithm PLPA based on the above-mentioned preferential learning model. (1) Information initialization. There is no learning behavior at the beginning and each node will undergo an initialization process. First, assign a globally unique label to each node, and set the node’s belonging coefficient to this label as 1, then calculate the degree of each node, the local clustering coefficient (cf. Eq. (3)), the Jaccard similarity coefficient and connection strength between node and its neighbor are also considered (cf. Eqs. (1) and (4)) (2) Information dynamic interaction. In the label propagation phase, the node first calculates its label similarity with neighbor nodes (cf. Eq. (5)) and its preference learning strength (cf. Eq. (6)) during each iteration. Then the learning target is selected and learned according to the preference learning probability (cf. Eq. (7)). Finally, taking into account the loss of information and normalize the tag list. As the iteration progresses, all the nodes in the network will reach a stable state and all the labels of nodes will converge. (3) Find overlapping communities. When the nodes in the network reach a steady state, each node will have one or more labels. Node with multiple labels will be considered as an overlapping node. The community structure will naturally be detected if we divide the nodes with the same label into the same community then. Fig. 3(a)–(c) shows three steps of a preference learning algorithm for simple kite network. At the initial phase (Fig. 3(a)), each node is assigned a label and the necessary information is calculated. Then each node interacts with its neighboring nodes according to the preference learning model. Fig. 3(b) shows the label of each node after the first iteration. As time evolves, all the nodes in the network will reach a stable state. Fig. 3.(c) shows two communities (community 3 and 7), which are naturally detected by dividing the nodes that have the same label into a community. The PLPA algorithm is further given as Algorithm 1.

3.5. Complexity analysis In PLPA algorithm, the time complexity is mainly composed of three parts. We give our discussion about time complexity as follows. (1) Information initialization. There is only one loop in this phase and the time complexity is O(n) where n is the number of nodes. The PLPA algorithm also needs to calculate necessary information such as node degree, Jaccard similarity coefficient and so on. The time complexity is O(k ∗ n), where k is the average degree of the network. Thus, the time complexity of this phase is O(n + k ∗ n).

8

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265

Fig. 3. Illustration of the preference learning algorithm. (a) The graph represents the initial phase of the kite network. (b) The label of each node after the first iteration. (c) The final state of kite network.

Algorithm 1 PLPA Input: G = (V , E); 1: //Information initialization. 2: for each node v in V do 3: set label(v ) = v ; 4: set belonging coefficient b(v, v )= 1.0; 5: calculate the local clustering coefficient using Eq. (3); 6: for each node u in N(v ) do 7: calculate the Jaccard similarity coefficient using Eq. (2); 8: calculate the connection strength using Eq. (4); 9: end for 10: end for 11: //Information dynamic interaction. 12: Flag = TRUE 13: while Flag do 14: shuffleOrder(V ); 15: for each node v in V do 16: for each node u in N(v ) do 17: calculate the label similarity (cf. Eq. (5)) between v and u; 18: calculate the preference degree (cf. Eq. (6)) of v for u; 19: select the learning target of v according to (cf. Eq. (7)) from N(v ); 20: learn and normalize labels; 21: if the belonging coefficient of v < 1/|N(x)| then 22: delete the label of v ; 23: end if 24: if reaches the iterative termination condition then 25: Flag = FALSE; 26: end if 27: end for 28: end for 29: end while 30: // Find overlapping communities. 31: for each node v in V do 32: for each label l in label list of node v do 33: v− > Cl ; 34: end for 35: end for 36: // return the resulting components C(communities) Output: Community set Setc ={C1 , C2 , C3 ...Ck } and k is the number of community.

(2) Information dynamic interaction. Since each node only selects one node as the learning target, the time complexity of computing the preference degree between node and its neighbors is O(k ∗ |L|) where |L| is the length of label list holding by node and this length is equal to k at most. Moreover, the time complexity of selecting the learning target is O(k). After learning, the time complexity of delete labels and normalize is O(|L|). The max number of iterations denoted as MaxIter, where MaxIter is typically between 50 and 200. Thus, the time complexity spent in the entire stage will not be greater than O(MaxIter ∗ n ∗ (k2 + 2k)), but exactly O(MaxIter ∗ n ∗ (k ∗ |L| + k + |L|)).

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265

9

(3) Find overlapping communities. The time complexity of this phase is O(|Setc | ∗ n), where |Setc | stands for the number of community sets. Usually |Setc | ≪ n. In summary, the time complexity of PLPA is the sum of the three parts, namely O(n + k ∗ n + MaxIter ∗ n ∗ (k ∗ |L| + k + |L|) + |Setc | ∗ n). We know that n < k ∗ n < MaxIter ∗ k ∗ n normally. Thus, the time complexity is less than O(MaxIter ∗ n ∗ k2 ). We note that the value of MaxIter and k is usually small. For sparse networks, we can even think of the PLPA almost as linear. 4. Experiment In this section, we will demonstrate the performance of our algorithm through a series of experiments in the field of overlapping communities. In order to explain the contribution made by proposed algorithm, we choose six distinguished algorithms in the area of overlapping community detection as comparing algorithms. However, the judgment of the performance of each method is a key issue. According to the different experimental data sets, we select suitable evaluation metrics, and the merits and demerits of the algorithm are measured from various angles and analyzed accordingly. This article describes the entire experimental process in three aspects. First, we introduce the comparing algorithm used in the experiment; then we apply seven algorithms (including our algorithm and comparing algorithm) to some real-world networks and observe their performance. Finally, we design and generate multiple synthetic networks to examine the above algorithms from different perspectives, and then analyze the experimental results to illustrate the applicable fields of the algorithm. The experimental environment is Intel i5-4590 3.2 GHz CPU, 8G RAM, Windows 8 operating system PC. In order to eliminate the effect of randomness, all experiments have been performed with the same conditions and repeated calculations 100 times and averaged. 4.1. Comparing algorithm Clique Percolation Method(CPM) [33] is a well-known overlapping community detection algorithm based on clique, which is a set of vertices where any two points are connected, also called a complete subgraph. This algorithm considers that the internal edges of the community are likely to form large complete subgraphs, while the edges between communities are almost impossible to form a large complete subgraph, this is in full compliance with the definition of community so that the community can be discovered by finding the clique in the network. In general, the complexity n of CPM is O(3.14 3 ), where n is number of nodes. LINK [34] reveals the community structure by using a partition of the links of a network. The algorithm allows communities to overlap on nodes and then implements overlapping community discovery by dividing the nodes of the original network line graph. In general, the complexity of Link is O(n × k2max ), where kmax is the maximum degree of nodes. COPRA [12] is based on the extended LPA for overlapping community detection algorithms, which adds a list of labels for each node. The length of the list is a parameter v of the algorithm. Each node can have a maximum number of v labels. However, the algorithm still has some disadvantages such as strong randomness, poor robustness, and monster community, this means dividing all nodes into the same community. Its complexity is O(v m log (v m/n)), where v is a parameter of the algorithm, and m and n represent the number of edges and nodes of the network. Speaker–listener Label Propagation Algorithm(SLPA) [24] is a novel overlapping community detection algorithm in which nodes exchange label according to specific interaction rules. Specifically, it features a node as the recipient of information and its neighbor as the sender. The sender defines the speak rule to provide the recipient with the label. The recipient accepts the label according to the pre-defined listener rule and stores it in the memory area of the node. After a certain number of iterations, labels in each node’s memory area that have a percentage of the total number of labels that are less than a certain threshold will be deleted and the node with the same label is considered as the same community. Among them, nodes with multiple labels are called overlapping nodes. Surprisingly, the complexity of SLPA is O(tm), where m is the number of edges and t is the maximum number of iterations, usually t is a number between 20 and 100. Demon [35] is a simple local first discovered method for overlapping communities, which lets each node vote for the communities it sees surrounding it in its limited view of the global system and finally the local communities are merged −α into a global collection. In addition, Demon’s complexity is O(n × k3max ), where α is the parameter of the algorithm. Overall, its complexity is sub-quadratic. Nectar [36] is an algorithm for overlapping community discovery extended from Louvain method (LM) [37] based on its local search heuristic features. The biggest feature of this algorithm is that this is the first community detection approach that selects objective function [36,38] dynamically according to the characteristics of the network at hand. The process of Nectar is divided into inner iteration and outer iteration. In the inner iteration, function is determined according to the number of triangles. In general, the complexity of Nactar is greater than O(n2 ). For a summary of the basic ideas and time complexity of our methods and comparing algorithm, refer to Table 2.

10

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265 Table 2 Algorithms included in the experiments. Algorithm

Basic ideas

Complexity

PLPA CPM LINK COPRA SLPA Demon Nectar

Preference learning Clique percolation Link partition Label propagation Label propagation Democratic voting Maximize function

less than O(MaxIter ∗ n ∗ k2 ) n O(3.14 3 ) O(n × k2max ) O(v m log (v m/n)) O(tm) 3−α O(n × kmax ) more than O(n2 )

Table 3 Some statistical properties of several real-world networks are node number |V |, edge number |E |, average degree k. Dataset

|V |

|E |

k

Karate [39] Dolphin [40] Football [41] Jazz [42] Email [43] Protein [44]

34 62 115 198 1133 2239

78 159 613 22,742 5,451 6,452

4.588 5.129 10.661 27.697 9.622 5763

4.2. Real-world networks 4.2.1. Data description In order to compare the performance of our algorithm with the above-mentioned comparison algorithm in realworld networks, we have selected some data sets that are often used for overlapping community discovery problems. All the data sets are available at the dataset of Stanford (http://snap.stanford.edu/data/), the dataset of Mark Newman (http://www-personal.umich.edu/~mejn/netdata/), (www.cs.bris.ac.uk/steve/networks/peacockpaper/) and the network data repository of KONECT (http://konect.uni-koblenz.de/networks/). Here’s an brief introduction to these real-world data sets in Table 3. Zachary’s karate club(karate): This network is a classic data set in the field of social network analysis. In the early 1970s, sociologist Zachary spent two years observing the social relations among 34 members of a karate club in a U.S. university. Based on the association of these members within the club, a social network between members is constructed. One side of the two nodes means that the corresponding members are at least friends with frequent contacts. Later, the club split into two clubs because of disagreements about the future direction of the club. Dolphin social network(dolphin): This is the dolphin social relations network that D. Lusseau et al. used for 7 years to observe the exchange of 62 dolphins in the Doubtful Sound Channel in New Zealand. This network has 62 nodes with 159 edges. Among them, nodes represent dolphins, and the sides indicate frequent contact between dolphins. American College football(football): is a complex social network according to the American College Football League. The network contains 115 nodes and 616 edges. The nodes in the network represents the football team, and the edge between the two nodes represents that there have been matches between the two teams. The 115 teams are divided into 12 leagues, and the matches between the teams within the league are far more than those between the teams in the league. Thus, the league can be expressed as the real community structure of the network. Jazz musicians(jazz): The network is a jazz artist cooperation network. The nodes in the network represent musicians, and the links between nodes represent the cooperative relationship between musicians. The network contains 198 nodes and 2742 edges. Email(email): This is a network data from a research group within the University of Rovira i Virgili (Tarragona) used to analyze individual social relationships within the research group, which consists of 1133 nodes and 5451 edges. Human protein-Figeys(protein): This is a network, including 2239 nodes and 6452 edges, reflects the interaction between human proteins. It is the first time in this field has used a mass spectrometry-based approach to reveal the interaction between proteins in human. 4.2.2. Evaluation metrics Since the data set adopted in this experiment belongs to the real-world data set, and the community structure of the real-world data set is not clear, the extended modularity EQ [45] is used as the evaluation metrics of this experiment. The range of the EQ value is [0,1]. The larger the value of EQ is, the better the findings of overlapping communities are. The EQ is defined as (8): EQ =

1 ∑∑ 2m

[Aij −

i∈C

j∈C

ki kj

]

1

2m Qi Qj

(8)

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265

11

Table 4 The EQ performance of several algorithms on real-world networks. Dataset

Karate

Dolphin

Football

Jazz

Email

Protein

PLPA CPM LINK COPRA SLPA Demon Nectar

0.262 0.132 0.148 0.224 0.188 0.04 0.262

0.3 0.205 0.135 0.187 0.092 0.336 0.199

0.44 0.151 0.16 0.398 0.289 0.32 0.2785

0.224 0.158 0.046 0.202 0.124 0.09 0.209

0.29 0.065 0.095 0.048 0.075 0.061 0.308

0.246 0.141 0.161 0.221 0.182 0.071 0.223

Table 5 Some common parameter in the LFR model. Properties

Definition

N k maxk mu minc maxc on om

The number of nodes Average degree Maximum degree Mixing parameter Minimum for the community sizes Maximum for the community sizes Number of overlapping nodes Number of memberships of the overlapping nodes

where ki represents the degree of node i, m represents the number of edges of the network, A is the adjacency matrix of the network, and Qi is the number of communities to which node i belongs. 4.2.3. Performance evaluation In order to further compare the several algorithms, we investigate the performance of them on the real-world networks. All parameters of the comparison algorithm, if the algorithm has parameters, used in the experiments according to its author’s suggestion. Table 4 shows the performance of proposed algorithm and comparison algorithm on the real-world networks. we can see that our algorithm is either the best performance or the second best performance in all real-word networks. Among them, the proposed algorithm is only slightly worse than the Demon in the dolphin network and Nectar in Email network, however, it performs significantly better on other data sets than the comparison algorithm. Demon’s performance is not very stable, especially the performance in the karate network is very struggling. The proposed algorithm as an improved form of COPRA, our algorithm and COPRA algorithm have a general trend, but due to considering the learning preferences, the performance is significantly better than COPRA. But what are the effects of network characteristics, such as degree, on algorithm performance? Next we will further analyze several algorithms by synthetic networks. 4.3. Synthetic networks Although the real-world networks can reflect the performance of the algorithm to some extent, they cannot reveal the characteristics of the algorithm well because of some restrictions. In this section we will generate synthetic network to compare the above algorithms in more detail and analyze accordingly. 4.3.1. Data description In order to better approach the characteristics of real-world networks, we have chosen the LFR [46] model to generate synthetic network. This model, widely used in the field of complex networks, can easily control the characteristics of the network that needs to be generated, such as the average degree and the maximum number of communities. Table 5 shows the commonly used LFR properties for controlling the characteristics of the synthetic network. What attracts special attention is the parameter mu, which is defined as the fraction of edges of node outside its community. Briefly, the higher the mu is, the vaguer the community structure is. We used LFR model to generate several sets of network data sets. If there are no special instructions we will keep the following settings throughout the whole experiment: the maximum degree maxk is 2.5*k and the maximum for the community sizes maxc is 5*minc. In the first group, we generated six data sets, and they had different number of nodes. The characteristics of the dataset are shown in Table 6. In addition, in order to verify the impact of the synthetic network feature on the algorithm, including mixing parameter mu, average degree k and the number of memberships of the overlapping nodes om, we generated three additional groups of data. In the second group of experimental data, we fix the number of nodes N = 2000, the average degree k = 16, the minimum for the community sizes minc = 10, the number of overlapping nodes on = 300, the number of memberships of the overlapping nodes om = 3, and then modify the mixing parameter mu between 0.1 and 0.8 to generate several networks. Moreover, we generated a third set of generation networks to observe the performance of the algorithm under

12

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265

Table 6 Some statistical properties of data set in first group: number of nodes N, edge number |E |, average degree k, mixing parameter mu, minimum for the community sizes minc, number of overlapping nodes on, number of memberships of the overlapping nodes om. Dataset

N

|E |

k

mu

om

on

minc

LFR1 LFR2 LFR3 LFR4 LFR5 LFR6

500 2,000 5,000 7,000 10,000 20,000

2,572 15,509 18,721 47,330 56,780 154,000

10.288 15.509 7.488 13.523 11.356 15.405

0.1 0.3 0.1 0.3 0.1 0.3

3 3 4 4 5 5

100 300 1000 1200 2000 4000

10 10 10 10 10 10

Table 7 Some statistical properties of data set in other three groups: number of nodes N, average degree k, mixing parameter mu, minimum for the community sizes minc, number of overlapping nodes on, number of memberships of the overlapping nodes om. Dataset

N

k

mu

om

on

minc

2nd 3rd 4th

2000 1000 5000

16 5 − 20 12

0.1 − 0.8 0.1 0.3

3 2 2 − 10

300 100 1000

10 10 10

different network tightness. In third group of the synthetic network, we let N = 1000, mu = 0.1, om = 2, on = 100, minc = 10, and then change the average degree k from 5 to 20 to generate several synthetic networks. Finally, there is also need to observe the effect of different om on the performance of the algorithm, so we use the LFR model to generate the fourth group of synthetic networks. In this set of data set, we fix N = 5000, k = 12, minc = 10, mu = 0.3, and then change the number of memberships of the overlapping nodes om from 2 to 10 to generate a group of synthetic networks. The details of each data set are shown in the following Table 7. 4.3.2. Evaluation metrics In the past decade or so, there have been many evaluation indicators [47–49] used to measure the performance of community discoveries, but they are not suitable for evaluating the overlapping community discovery algorithm proposed above. We will use the Extend Normalized Mutual Information(ENMI) [50], proposed by Lancichinetti, which is widely used in overlapping community discovery to evaluate our algorithms and comparison algorithms. The ENMI value [0,1], which indicates the degree of similarity between the divided community and the real community. The closer the result is to 1, the better the segmentation effect. The ENMI is defined as (9): ENMI = ∑

−2 i∈C ′′



i∈C ′ ,j∈C ′′

Nij log Nij N /Ni Nj

Nj log (Ni /N ) +

(



j∈C ′

)

Ni log Nj /N

(

)

(9)

Among them, C ′ represents the real community structure, C ′′ represents the community structure obtained by the algorithm, N is the number of nodes in the network, Ni represents the number of nodes in the ith community in C ′ , and Nj represents the node of the jth community in C ′′ . Moreover, Nij represents the number of nodes shared by the ith community in C ′ and the jth community in C ′′ . It is not hard to see that the ratio of the number of communities detected by the algorithm to the number of real communities in ground truth in the benchmark can also show the accuracy of the algorithm to a certain extent. Therefore, we introduce another evaluation metrics: ratio [24], which is defined as the ratio between the number of communities detected by the algorithm and the number of real network communities. Of course, as mentioned above, the ratio would be the average of 100 experiments. In addition, the results of SLPA, COPRA, PLPA are inevitable with certain randomness on account of they all extend from LPA algorithm. We adopted the std(standard deviation) [51] of ratio to characterize the stability of these three algorithms in the community discovery process. The standard deviation is defined as (10):

  N 1 ∑ (ratioi − ratio)2 std = √ N

(10)

i=1

where N represents the number of experiments (this experiment is 100), ratioi represents the ratio of the number of communities found in the ith experiment and the number of communities actually, and ratio is the average of N experimental ratios. 4.3.3. Performance evaluation Initially, the first group of synthetic networks was used to test performance of the seven algorithms. We use ENMI as the main evaluation indicator to reveal performance of the seven algorithms. Table 8 shows the ENMI performance of each algorithm on each data set. It can be seen in Table 8 that our algorithm has the best(LFR1,2,3,4) or second best(LFR6) performance in five of all six data sets. The only one that did not rank in the top two was LFR5. By analyzing Table 8, we can easily see that the

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265

13

Table 8 The ENMI performance of several algorithms on the first group of LFR synthetic network. Dataset

LFR1

LFR2

LFR3

LFR4

LFR5

LFR6

PLPA CPM LINK COPRA SLPA Demon Nectar

0.7075 0.56 0.32 0.599 0.69 0.4203 0.6377

0.7759 0.75 0.349 0.608 0.74 0.607 0.76

0.6197 0.2497 0.391 0.498 0.47 0.4622 0.45

0.6738 0.6 0.269 0.421 0.612 0.5635 0.66

0.5095 0.572 0.362 0.493 0.55 0.484 0.61

0.602 0.62 0.265 0.494 0.506 0.499 0.6

Fig. 4. The performance of several algorithms on LFR synthetic networks by varying the mixing parameter mu 0.1 to 0.8.

CPM is not stable and it performs well on some data sets(LFR5,6), but it performs poorly on other data sets. The Link struggles in most algorithms. However, as we have just analyzed, the proposed algorithm performs very well on most data sets, especially on LFR3, which has the lowest average degree of all data sets. When the value of ENMI obtained by other algorithms is below 0.5, the proposed algorithm can still obtain scores above 0.6. We will verify the effect of the average degree on several algorithms in the third group of experiments. In order to observe the effects of changes in different network characteristics on the above algorithms, we generated three additional groups of data sets from the perspective of mu, k, and om changes. We apply each algorithm to the second data set, keeping the other properties same and changing the mixing parameter mu from 0.1 to 0.8. The results obtained are shown in Fig. 4a. We can observe that all the performance of the algorithm decreases with increasing mu, which is also consistent with the practical significance of mu. The PLPA can get the performance of the top two on almost all data sets and can even perform best in most of the data sets(mu = 0.2–0.5). It is worth noting that the COPRA algorithm can maintain a stable performance before mu < 0.6, but when mu > 0.6, the performance of COPRA will decline sharply. It can be seen that the COPRA algorithm is almost incapable of ambiguous networks of overlapping community structures. The LINK algorithm has almost no advantage over other algorithms. And PLPA algorithm is less sensitive to the increase of mu than other

14

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265

Fig. 5. The performance of several algorithms on LFR synthetic networks and the average degree k varying from 5 to 25.

algorithms. The experimental results show that although the NECTAR does not performs so good at low mu, it is surprising that the algorithm can maintain good division results when mu is high, where the community structure here are very blurred. In addition, the ratio of the number of communities to the number of communities in the ground truth, that is ratio mentioned above, is also one of the indicators to measure the merits of an algorithm. We counted the ratio obtained by each algorithm on the second group of data sets, as shown in Fig. 4b. The reason behind not displaying the Link algorithm is that the number of communities found by the link algorithm is much larger than the number of real communities, and sometimes the ratio will reach more than one hundred, which means that there is no meaning compared to other algorithms. We also want to know whether our algorithm performs better than other LPA-based algorithms (COPRA, SLPA) in terms of stability of the community discovery number. We calculated the standard deviation of the ratio as shown in Fig. 4c. We can get some information from Fig. 4b. when mu < 0.5, the number of communities found by the vast majority of algorithms is very close to the real situation, and the number of communities discovered by PLPA and SLPA is almost equal to the real situation. However, with mu increasing to more than 0.6, the COPRA algorithm divides all nodes in the network into one community, that is, the monster community phenomenon will occur. A similar situation also appears on NECTAR. The CPM and SLPA algorithms also appear to be more inclined to divide more nodes into one community as mu increases, but the degree is far from COPRA. Only PLPA and Demon algorithms can get good results when mu > 0.6, but Demon does not perform as well as PLPA by varying mu from 0.2 to 0.6. Fig. 4c tells us the standard deviations of three algorithms based on label propagation. The reason why the COPRA algorithm has a large standard deviation when mu = 0.5 is that sometimes the COPRA algorithm can divide the community structure normally, however, it can only divide a monster community that contains all the nodes in the network because of its randomness. And when mu > 0.6, the reason why std = 0 is that COPRA can only divide the monster community. Although PLPA will also increase instability in the process of partitioning as mu increases, overall it not only overcomes the disadvantages of COPRA that can only divide the monster community in a network with a blurred community structure, but also improves the stability of the division results. Based on the above facts, PLPA has the following advantages: relative insensitivity to mu, high accuracy, high stability, the results are closer to the actual situation and so on.

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265

15

Fig. 6. The performance of several algorithms on LFR synthetic networks and the number of memberships of the overlapping nodes om varying from 2 to 10.

We apply several algorithms to the third group of data sets, which keeps other characteristics unchanged, and the average degree k varies from 5 to 20. The ENMI performance of each algorithm is shown in Fig. 5a. We recorded the ratio values of various algorithms in the experiment, then we generated Fig. 5b based on these results. We also paid special attention to the standard deviations of the ratios of CPORA, SLPA and PLPA. The result is shown in Fig. 5c: We can learn from Fig. 5a that all the performance of the algorithms are positively correlated with the average degree k. The performance of algorithm will become better with the increase of k until it reaches a stable value. Exactly, all the algorithms have similar and outstanding performance in sufficiently dense networks with the exception of the LINK algorithm, where ENMI scores are all above 0.9, which means that the result of the algorithm is basically consistent with the actual situation. However, it is worth noticing that in sparse networks (such as k = 5,8 in this group of data sets), PLPA can still obtain good experimental results under the condition where other algorithms do not perform well. This shows that the PLPA is more suitable for sparse networks than the comparison algorithm. We can get similar conclusions from Figs. 5b and 5c. All algorithms have a ratio of almost equal to 1 in dense networks, which means that the number of communities they find is almost completely equal with realistic cases. And the three LPA-based algorithms that we are particularly concerned with have high stability performance in dense networks. However, in sparse networks (k = 5,8), the situation is very different. COPRA, SLPA and NECTAR tend to divide more non-meaningful communities, and the performance of CPM is very uncertain. Only PLPA and Demon are basically unaffected, and the number of community discoveries is very close to the actual situation. In the comparison between the two, PLPA is undoubtedly the better one. From the results of Fig. 5c, it can be seen that PLPA performance is very stable in both dense networks and sparse networks, and is almost unaffected by changes in the average degree k. Finally, we also want to verify the number of memberships of the overlapping nodes om, which can also be defined as the number of communities that each node belongs to at the most, the impact on the algorithm and the performance of each algorithm. Similarly, we record the value of ENMI, each algorithm obtained with om varying from 2 to 10 by Fig. 6a, and the ratio of the number of communities found by each algorithm to ground truth and the stability of the three LPA-based algorithms have been shown in Figs. 6b 6c:

16

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265

Fig. 7. The runtime of several algorithms on LFR synthetic networks and the number of nodes varying from 1000 to 50M.

From the above experimental results, we can analyze that the trends of both ENMI and ratio are negatively related to the growth of om, that is to say, both indices decrease as om increases. The reason for this phenomenon is that the greater value of om is, the larger number of communities that some nodes are subordinate to at same time are, and this will lead to more and more communities overlap. In this case, the algorithm considers nodes that actually belongs to different communities being a community in the process of community division, which explains why the value of ratio decreases as om increases. Moreover, when om is larger and the division result obtained by the algorithm is more deviated from the actual situation, that is why the algorithm struggles in the high om case. Even then we can still see from Fig. 6a that our algorithm and Demon algorithm almost always perform the best of the two algorithms on the entire group of data sets. Among them, the proposed algorithm is the best among four data sets (om = 3,4,5,6) while Demon algorithm is the best in the remaining four data sets (om = 7,8,9,10). The CPM algorithm has no advantage when the om is low, but when the om is high, it will have roughly the same results as other algorithms. However, the LINK algorithm always lags behind other algorithms. The results in Fig. 6b tells us that our algorithm is always the closest to the actual situation in terms of community discovery because the value of ratio is always closer to 1 whether om is high or low. It is worth noting that the ratio obtained by NECTAR does not seem to change with the increase of om. CPM divided many non-meaningful communities when om was relatively low (om = 2,3), which also confirms our analysis above that is CPM has lower ENMI when om is low. The COPRA algorithm shows great fluctuations. At om < 8, the ratio is greater than 1.2 and when om > 8, the ratio is less than 0.9. The results in Fig. 6c shows that at om > 8, the result of community detection by COPRA algorithm is very unstable. In fact, the stability of the COPRA is not as good as SLPA and PLPA on the entire data set. 4.4. Runtime In order to verify the running time of above algorithms, we use the LFR model to generate a group of data sets. In this group of data, we set mu = 0.1, om = 2, k = 10, and on/k = 0.1, and the number of nodes varying from 1000 to 50M. The experimental results obtained are shown in Fig. 7. The reason why not showing CPM, LINK and Nectar runtime on 50M nodes is that those time is greater than 24 h in our experimental environment. The efficiency of PLPA is greater than CPM, LINK, Nectar and Demon because its complexity is less than O(MaxIter ∗ n ∗ k2 ). This advantage will increase as the number of nodes increases. Therefore, PLPA is suitable for solving the problem of overlapping communities in large-scale networks. Indeed, PLPA is slower than SLPA and COPRA, but the proposed algorithm not only effectively improve the accuracy of detection overlapping communities(cf. Figs. 4a, 5a, 6a) but also effectively reduce randomness(cf. Figs. 4c, 5c, 6c). 5. Conclusions In this paper, we propose a preference learning mechanism based on learning behavior and information interaction in social networks, and present a highly accurate, applicable and parameter free overlapping community detection algorithm PLPA. Then we verify the performance of PLPA through real-world and synthetic networks. Experiments show that compared with many state-of-the-art algorithms, PLPA not only has a good performance on most data sets but also is more applicable to some networks with ambiguous community structure, especially sparse networks. Furthermore, PLPA can greatly reduce the randomness of the algorithm compared to the algorithms of the same family. Finally, Our interesting future research directions include parallel implementation and overlapping communities detection in a dynamic network. Acknowledgments The research was supported by the National Science and Technology Major Project of China(2017ZX06002005).

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265

17

References [1] Z. Sun, B. Wang, J. Sheng, Y. Hu, Y. Wang, J. Shao, Identifying influential nodes in complex networks based on weighted formal concept analysis, IEEE Access 5 (99) (2017) 3777–3789. [2] J. Shao, Q. Yang, H.V. Dang, B. Schmidt, S. Kramer, Scalable clustering by iterative partitioning and point attractor representation, ACM Trans. Knowl. Discov. Data 11 (1) (2016) 5. [3] H.J. Li, Z. Bu, Z. Wang, J. Cao, Y. Shi, Enhance the performance of network computation by a tunable weighting strategy, IEEE Trans. Emerg. Top. Comput. Intell. 2 (3) (2018) 214–223. [4] Z. Bu, H.J. Li, J. Cao, Z. Wang, G. Gao, Dynamic cluster formation game for attributed graph clustering, IEEE Trans. Cybern. PP (99) (2017) 1–14. [5] S. Fortunato, D. Hric, Community detection in networks: A user guide, Phys. Rep. 659 (2016) 1–44. [6] H.J. Li, J.J. Daniels, Social significance of community structure: statistical view, Phys. Rev. E 91 (1) (2015) 012801. [7] J. Xie, S. Kelley, B.K. Szymanski, Overlapping community detection in networks:the state-of-the-art and comparative study, ACM Comput. Surv. 45 (4) (2011) 1–35. [8] L. Hui-Jia, B. Zhan, L. Yulong, Z. Zhongyuan, C. Yanchang, L. Guijun, C. Jie, Evolving the attribute flow for dynamical clustering in signed networks, Chaos Solitons Fractals 110 (110) (2018) 20–27. [9] W. Chen, Z. Liu, X. Sun, Y. Wang, A game-theoretic framework to identify overlapping communities in social networks, Data Min. Knowl. Discov. 21 (2) (2010) 224–240. [10] L. Chen, J. Zhang, L.J. Cai, Overlapping community detection based on link graph using distance dynamics, Internat. J. Modern Phys. B 32 (3) (2018) 1850015. [11] T. Wu, Y. Guo, L.T. Chen, Y.B. Liu, Fast overlapping and hierarchical community detection via local dynamic interaction, Eprint Arxiv 448 (2014) 68–80. [12] S. Gregory, Finding overlapping communities in networks by label propagation, New J. Phys. 12 (10) (2009) 2011–2024. [13] Z.X. Wu, X.J. Xu, Z.G. Huang, S.J. Wang, Y.H. Wang, Evolutionary prisoner’s dilemma game with dynamic preferential selection, Phys. Rev. E 74 (1) (2006) 021107. [14] D.J. Winston, Gordon C. Zimmerman, Peer effects in higher education. discussion paper, Acad. Achievement 39 (2) (2003) 43. [15] G. Palla, A.L. Barabsi, T. Vicsek, Quantifying social group evolution., Nature 446 (7136) (2007) 664. [16] J. Sun, R. Fan, M. Luo, Y. Zhang, L. Dong, The evolution of cooperation in spatial prisoners dilemma game with dynamic relationship-based preferential learning, Physica A 512 (2018) 598–611. [17] Q. Pan, X. Liu, H. Bao, Y. Su, M. He, Evolution of cooperation through adaptive interaction in a spatial prisoners dilemma game, Physica A 5492 (2018) 571–581. [18] G. Palla, I. Dernyi, I. Farkas, T. Vicsek, Uncovering the overlapping community structure of complex networks in nature and society, Nature 435 (7043) (2005) 814. [19] J. Li, X. Wang, Y. Cui, Uncovering the overlapping community structure of complex networks by maximal cliques, Physica A 415 (2014) 398–406. [20] J. Cheng, W. Xiao, M. Zhou, S. Gao, L. Cong, A novel method for detecting new overlapping community in complex evolving networks, IEEE Trans. Syst. Man Cybern. Syst. PP (99) (2018) 1–13. [21] A. Lancichinetti, S. Fortunato, J. Kertsz, Detecting the overlapping and hierarchical community structure of complex networks, New J. Phys. 11 (3) (2009) 19–44. [22] H.J. Li, Z. Bu, A. Li, Z. Liu, S. Yong, Fast and accurate mining the community structure: Integrating center locating and membership optimization, IEEE Trans. Knowl. Data Eng. 28 (9) (2016) 2349–2362. [23] U.N. Raghavan, R. Albert, S. Kumara, Near linear time algorithm to detect community structures in large-scale networks, Phys. Rev. E 76 (2) (2007) 036106. [24] J. Xie, B.K. Szymanski, X. Liu, Slpa: Uncovering overlapping communities in social networks via a speaker-listener interaction dynamic process, in: IEEE International Conference on Data Mining Workshops, 2011, pp. 344–349. [25] S. Lim, S. Ryu, S. Kwon, K. Jung, Linkscan*: Overlapping community detection using the link-space transformation, in: IEEE International Conference on Data Engineering, 2014, pp. 292–303. [26] G. Cai, S. Jiang, S. Cai, L. Tian, Cluster synchronization of overlapping uncertain complex networks with time-varying impulse disturbances, Nonlinear Dynam. 80 (1–2) (2015) 503–513. [27] S. Jiang, G. Cai, S. Cai, L. Tian, X. Lu, Adaptive cluster general projective synchronization of complex dynamic networks in finite time, Commun. Nonlinear Sci. Numer. Simul. 28 (13) (2015) 194–200. [28] J. Shao, Z. Han, Q. Yang, T. Zhou, Community detection based on distance dynamics, in: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 1075–1084. [29] X. Zhao, Y. Wu, C. Yan, Y. Huang, An algorithm based on game theory for detecting overlapping communities in social networks, in: International Conference on Advanced Cloud and Big Data, 2017, pp. 150–157. [30] X. Li, M. Jusup, Z. Wang, H. Li, L. Shi, B. Podobnik, H.E. Stanley, S. Havlin, S. Boccaletti, Punishment diminishes the benefits of network reciprocity in social dilemma experiments, Proc. Natl. Acad. Sci. USA 115 (1) (2018) 30–35. [31] Z. Xu, X. Zhao, Y. Liu, S. Geng, A game theoretic algorithm to detect overlapping community structure in networks, Phys. Lett. A 382 (13) (2018) 872–879. [32] L. Zhou, P. Yang, K.L.L. Wang, H. Chen, A Fast Approach for Detecting Overlapping Communities in Social Networks Based on Game Theory, Springer International Publishing, 2015. [33] G. Palla, I. Dernyi, I. Farkas, T. Vicsek, Uncovering the overlapping community structure of complex networks in nature and society, Nature 435 (2005) 814–818. [34] A. Yong-Yeol, J.P. Bagrow, L. Sune, Link communities reveal multiscale complexity in networks, Nature 466 (7307) (2009) 761. [35] M. Coscia, G. Rossetti, F. Giannotti, D. Pedreschi, Demon:a local-first discovery method for overlapping communities, in: ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, 2012, pp. 615–623. [36] Y. Cohen, D. Hendler, A. Rubin, Node-centric detection of overlapping communities in social networks, in: IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2016, pp. 1–10. [37] V.D. Blondel, J.L. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of communities in large networks, J. Stat. Mech. 2008 (10) (2008) 155–168. [38] M. Chen, K. Kuzmin, B.K. Szymanski, Extension of modularity density for overlapping community structure, in: IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2014, pp. 856–863. [39] W.W. Zachary, An information flow model for conflict and fission in small groups, J. Anthropol. Res. 33 (4) (1977) 452–473. [40] D. Lusseau, K. Schneider, O.J. Boisseau, P. Haase, E. Slooten, S.M. Dawson, The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations, Behav. Ecol. Sociobiol. 54 (4) (2003) 396–405. [41] M. M. E. J. Girvan, Newman, Community structure in social and biological networks, Proc. Natl. Acad. Sci. USA 99 (12) (2002) 7821–7826. [42] P.M. Gleiser, L. Danon, Community structure in jazz, Adv. Complex Syst. 6 (04) (2003) 565–573.

18

J. Sheng, K. Wang, Z. Sun et al. / Physica A 527 (2019) 121265

[43] R. Guimer, L. Danon, A. Daz-Guilera, F. Giralt, A. Arenas, Self-similar community structure in a network of human interactions, Phys. Rev. E 68 (6 Pt 2) (2003) 065103. [44] R.M. Ewing, P. Chu, F. Elisma, H. Li, P. Taylor, S. Climie, L. Mcbroomcerajewski, M.D. Robinson, L. O’Connor, M. Li, Large-scale mapping of human protein-protein interactions by mass spectrometry, Mol. Syst. Biol. 3 (1) (2014) 89–89. [45] H. Shen, X. Cheng, K. Cai, M.B. Hu, Detect overlapping and hierarchical community structure in networks, Physica A 388 (8) (2009) 1706–1712. [46] A. Lancichinetti, S. Fortunato, F. Radicchi, Benchmark graphs for testing community detection algorithms, Phys. Rev. E 78 (4 Pt 2) (2008) 046110. [47] A. Strehl, J. Ghosh, Cluster Ensembles — A Knowledge Reuse Framework for Combining Multiple Partitions, JMLR.org, 2003. [48] W.M. Rand, Objective criteria for the evaluation of clustering methods, Publ. Amer. Stat. Assoc. 66 (336) (1971) 846–850. [49] Y. Zhao, G. Karypis, Criterion functions for document clustering experiments and analysiss, in: Machine Learning, 2002, pp. 311–331. [50] A. Lancichinetti, S. Fortunato, J. Kertsz, Detecting the overlapping and hierarchical community structure of complex networks, New J. Phys. 11 (3) (2009) 19–44. [51] F. Mosteller, Remarks on the method of paired comparisons: I. the least squares solution assuming equal standard deviations and equal correlations, Psychometrika 16 (1) (1951) 3–9.