InfMatch: Finding isomorphism subgraph on a big target graph based on the importance of vertex

InfMatch: Finding isomorphism subgraph on a big target graph based on the importance of vertex

Physica A 527 (2019) 121278 Contents lists available at ScienceDirect Physica A journal homepage: www.elsevier.com/locate/physa InfMatch: Finding i...

679KB Sizes 0 Downloads 47 Views

Physica A 527 (2019) 121278

Contents lists available at ScienceDirect

Physica A journal homepage: www.elsevier.com/locate/physa

InfMatch: Finding isomorphism subgraph on a big target graph based on the importance of vertex ∗

Tinghuai Ma a , , Siyang Yu a , Jie Cao a , Yuan Tian b , Mznah Al-Rodhann c a b c

School of Computer & Software, Nanjing University of Information Science & Technology, Jiangsu, Nanjing 210-044, China Nanjing Institute of Technology, Jiangsu, Nanjing 210-044, China College of Computer and Information Science, King Saud University, Saudi Arabia

highlights • Propose a method based on the influence value of the vertices to determine the matching order. • Propose a extending method based on the central node for generating a sub-areas. • Propose a subgraph matching algorithm to find the candidate nodes on sub-areas instead of the whole target graph.

article

info

Article history: Received 15 February 2019 Received in revised form 16 April 2019 Available online 29 April 2019 Keywords: Influential node Subgraph matching Kshell

a b s t r a c t Subgraph matching is an important research topic in the area of graph theory and it has been applied in many areas in nowdays. Filtering and verification are two main processes of subgraph matching algorithms. However, there exists many invalid nodes in candidate matching set after initializing the candidate set for each query node, which may result in a quantity of redundant computation during the filtering period. Regarding the problem mentioned above, in this paper, we propose a subgraph matching algorithm based on node influence, denoted as InfMatch, to improve the performance of subgraph matching on a large target graph. Specially, we find the central node of query graph by calculating the global and local influence value of each query node, after which candidate matching nodes for each query node are found from the neighborhood region of the candidate nodes for the central node. Since the central node we choose connects tightly with other nodes, isolated nodes can′ t be added into the candidate matching set for central node and thus a number of unqualified candidate vertices are pruned. To further prune the unqualified candidate nodes, we propose several filter strategies according to the characteristics of our method. What′ s more, considering edge limitation, we improve the matching order selection strategy. Extensive experiments demonstrate that our method is more efficient. © 2019 Elsevier B.V. All rights reserved.

1. Introduction With the development of Internet and computer science, in the past few decades, combined structure, especially graph, has attracted wide attention. As a data structure that can represent the complex relationship between data, its strong expressiveness, simplicity and deep theoretical background make graph one of the most useful modeling tools. Graph is a data structure that represents the relationship between entities consisting of a set of vertices and edges [1]. By analyzing ∗ Corresponding author. E-mail address: [email protected] (T. Ma). https://doi.org/10.1016/j.physa.2019.121278 0378-4371/© 2019 Elsevier B.V. All rights reserved.

2

T. Ma, S. Yu, J. Cao et al. / Physica A 527 (2019) 121278

graph structure, problems such as network optimization, subgraph matching and path optimization can be solved. Graph theory begins with casual mathematics (digital games), but it has evolved into an important area of mathematical research and is used in many fields such as chemistry, biology, social sciences, and computer science. Subgraph matching is an important algorithm which is widely applied in graph analysis [2]. The aim of subgraph matching is to find subgraphs which are similar with query graph from target graph [3]. Subgraph isomorphism and inexact graph matching are two vital methods to solve the graph matching problem. Graph isomorphic problem has strict constraint [4], requiring the same structure. Two main applications of subgraph matching are introduced below. Example 1 (Finding the Function of PPI). If two protein–protein interaction (PPI) network has similar structure, the function of these two partitions may be the same. If there is a subgraph graph in target graph whose structure is similar with a given query graph, we can infer that there is a rather high possibility that the function of this subgraph is the same as that of the query graph [5,6]. Therefore, the function of PPI can be inferred by finding the function of PPI with the similar structure. Example 2 (Group Finding). In social networks, it is sometimes necessary to find a group (such as a criminal gang) by using part of the information we have known in this group. Information known on a social network (or other network) can be converted into a query graph and the group can be found by finding the isomorphism subgraph of the query graph on social networks. That is to say, by finding a subgraph containing partial information of a group in the target graph, we can easily find the group we want. As shown in Fig. 1, supposing that we want to find Group 1 and we have known part of its structures (v1 , v2 , v3 , v4 and the relationships between them), the same subgraph can be found (represented by shadowed nodes) by subgraph matching algorithm. By expanding this part, we can get all the information of Group 1. That can be used to find criminal gangs. Members of the same criminal gang must have close connection. After converting the information of relationships among convicts to a small network, we can use graph matching to find the subgraph which is contained by the criminal gang [7]. According to the accuracy of the subgraph matching algorithm, it can be divided into two categories: the exact subgraph matching problem, which is also called subgraph isomorphism problem, and the approximate subgraph matching problem [8]. With regard to the former, the exact subgraph matching problem has strict structural constraints. Its purpose is to find the same structure as the query graph. By contrast, approximate graph matching methods intend to find subgraphs that is similar with the query graph on target graph, which means that the found subgraph does not need to be identical to the query graph structure, that is, some differences between the query graph and the found graph is allowed in terms of graph structure. In this paper, we mainly focus on subgraph isomorphism on a large target graph. ‘‘Filtering-Verification’’ is the mainly used framework of subgraph matching algorithms. During filtering phase, candidate matching vertex for each query node are found before some of the unnecessary candidate nodes are pruned out according to corresponding pruning strategies. After that, all the isomorphic subgraphs are found by searching the candidate matching sets. During the filtering phase, candidate sets should be initialized firstly, when many isolate nodes may be added into candidate set. And it results in amount of redundant computation for filtering. To solve the problems mentioned above, we find the central node of query graph by calculating the influence of query nodes to reduce the effect of isolate nodes. What is more, by exploring the k-neighbor sub-area of each candidate node of central node, we can find the candidate matching set for each query node in each candidate sub-area. During the exploring phase, we propose four filtering strategies to further prune unnecessary nodes in candidate matching sets. The rest of this paper is organized as follows. Section 2 introduces some related work of isomorphism graph matching algorithms. Then, the framework of our proposed algorithm and some terms are defined in Section 3. And we introduce the details of our proposed method in Section 4. Following that, experiment is mentioned in Section 5. At last, conclusion and future work are presented. 2. Related work Subgraph matching is a typical field of graph analysis which was first proposed in 1976 by Ullmann [9]. Without any complex operation, DFS (Depth First Search) traverse are used in [9] to find the isomorphic graphs. That is to say, Ullmann algorithm finds all the isomorphic subgraphs by enumerating. Many methods were proposed after that to improve the performance of subgraph matching. ‘‘Filtering-Verification’’ is the common framework adopted by most exact subgraph matching methods. Filtering phase is divided into two part. First of all, most algorithms find the candidate matching set for each query node by finding nodes with the same label as it on target graph after applying some basic judgment rules. After that different pruning strategies are adopted to prune out invalid nodes in each candidate set. Cordella et al. [10] first proposed the concept of filtering in 2004. VF2 algorithm proposed by them prunes nodes that do not meet the rules considering the degree of nodes. It requests that the degree of the target graph nodes should no less than that of the node it matches. Since only the number of neighbor nodes are considered during filtering, VF2 cannot effectively filter the invalid nodes in candidate matching sets. As the size of the graph increases, it cannot obtain result in a reasonable time. Therefore,

T. Ma, S. Yu, J. Cao et al. / Physica A 527 (2019) 121278

3

Fig. 1. Group finding.

after that, many algorithms are proposed to improve the performance of filtering by constructing index. ‘‘k-neighborhood signature’’ is defined for nodes by SPath [11] to build an index for each node in the target graph to improve its pruning ability. In detail, the algorithm uses NS(u) to store the label of k-hop neighbors of u. It judges whether the label of its k-hop neighbor meets the requirement to determine whether the node need to be pruned. if NS(v ) ⊆ NS(u), node u is likely to be a candidate node for node v . TurboISO [12] adopts two pruning strategies: (1) If u is a candidate matching node of v , then the number of neighbors of u must be not less than the number of neighborhoods of v ; (2) if u is a candidate matching node of v , the label set of neighbor nodes of u must contain the label set of neighbor nodes of v . More importantly, Han et al. [12] believe that nodes sharing the same neighbors in the query graph can be represented by a vertex, so it compresses the query graph by using one node to represent multiple vertices that share the same neighbors. After that, based on the conception of graph compression, BoostIso [13] speed up graph matching by compress the target graph, which can speed up graph matching process. In [5], Bonnici V et al. adopt almost the same pruning strategies as VF2 does. It focuses more on finding a more suitable verification order. As for the period of verification, finding a good order for query nodes is important. VF2 choose a node that connects with verified nodes randomly. Compared with Ullmann algorithm that determines verifying order randomly, VF2 is more efficient. SPath matches one path at a time, which means that the multiple nodes are verified at a time, the path with smaller size of candidate set and shorter length is first verified. TurboISO suggests that the node with fewer candidate nodes should be matched first. After that, Bonnici V et al. [5] concentrated on finding a more proper verification order, they believe that the node with the largest number of neighbors which have been verified should be verified first. Also, in recent years, some methods are proposed to solve more complex graph matching problem. For example, Shemshadi et al. [14] proposed a method for graphs with multi-Labeled nodes. And graph matching for integrating multiple data sources are discussed by Zhang et al. [15]. Using subgraph matching to answering natural language questions over knowledge graphs is solved in [16,17]. There often exists more than one isomorphic subgraph in target graph, most existing algorithms find the candidate matching nodes of all potential isomorphic subgraphs together and verify the candidate node sets one by one. However, a large number of unqualified candidate nodes are obtained after the candidate set initial process. For example, in Fig. 2, the initial candidate set of v3 , obtained by most subgraph matching methods, is u3 , u4 , u7 . But it is obvious that only u3 is the valid candidate node of v3 . If we can find the potential matching sub-areas before obtain candidate sets for vertices, a large number of unqualified nodes can be pruned before the initial stage of candidate nodes sets. For example, in Fig. 2, assuming that we have found the sub-area 1, which is the potential matching sub-area, a smaller candidate set of each query vertex is obtained where C (v1 ) = u1 , C (v2 ) = u2 , C (v3 ) = u3 . Redundant node as u6 , u4 and u7 can be pruned at the first step. The concept of candidate sub-areas is proposed in [12] firstly, it chooses the central node by considering the frequency of labels and information about 1-hop neighbors. And information about 1-hop neighbors is also used to prune out invalid candidate nodes. However, its central node selection strategy is not suitable for all kinds of target graphs such as sparse graphs or graph with similar label frequency. Many unnecessary candidate nodes for query vertex cannot be pruned at initialization period which will result in redundant computation for filtering. Therefore, our aim is to find a more suitable start vertex to minimize the number of initial sub-areas to improve the performance of the subgraph matching problem. We solve the problem mentioned above by calculating the influential value for each query node. Since global and local structures (k-hop neighbors) are considered, the node with the largest influential value is connected with other query nodes tightly and is suitable for all kinds of target graph. What is more, isolated nodes can’ be added into the candidate matching sets. In summary, we make the following contributions: (1) Influence for each node is computed by combining the global and local structural characteristics of it to determine the central node. The candidate matching nodes of each query node are found in the domains of the candidate matching

4

T. Ma, S. Yu, J. Cao et al. / Physica A 527 (2019) 121278

Fig. 2. An example to explain the shortage of most existing algorithms.

Fig. 3. An example of subgraph isomorphism.

Fig. 4. The difference between DFS and BFS.

nodes for the central node. Since node influence considers the global and local structural information of the node, the central node selected in this paper is more closely connected with other query nodes. And thus some isolated or invalid nodes can be eliminated when the candidate set is initialized, which greatly improves the efficiency of the filtering phase. According to experiments, the proposed method can get a smaller set of initialization candidates. What is more, new pruning strategies are proposed to prune out unnecessary candidate nodes and sub-areas further. (2) Considering the node influence value and the connective relationships among of node, our proposed strategy for finding matching order of query graph is more efficient for graph search according to the experiment. 3. Preliminaries 3.1. Problem definition Undirected graphs with vertex labels are discussed in our paper. To avoid the ambiguous, there are only one edge between two nodes. A graph is a set G(V , E , L), where V represents nodes in graph, E ⊆ V × V represents edges in graph and L represents labels of nodes in graph. |Vg | represents the number of vertices in graph G and |Eg | is the number of edges in graph G. The neighborhood set of vertex v is N(v ) = {i|i, v ∈ G, (i, v ) ∈ EG }. In the following sections, vertices of query graph are represented by v while vertices of target graph are represented by u. The main topic we discuss in this paper is exact subgraph matching which is the problem of graph isomorphism. The formal definition of it is introduced as follow. Graph Isomorphism. Assuming that there is a target graph G(u) and a query graph Q (v ), graph isomorphism problem is to find the same structure as Q in G which is a bijective function f : VQ − > VG , so that ∀v ∈ VQ , ∃u ∈ VG , lv = lu and ∀(v, v ′ ) ∈ EQ , (f (v ), f (v ′ )) ∈ EG , where VQ represents nodes in Q , VG represents nodes in G, and lv = lu means that the label of v is equal to the label of u. For example, in Fig. 3(b), there are two subgraphs which are isomorphic to the query graph in Fig. 3(a) which are {v1 -u1 , v2 -u2 , v3 -u3 , v4 -u4 } and {v1 -u5 , v2 -u6 , v3 -u7 , v4 -u8 }.

T. Ma, S. Yu, J. Cao et al. / Physica A 527 (2019) 121278

5

To find all the subgraphs in target graph which are isomorphic to the query graph, we need to find the sub-areas which may contain the isomorphic subgraphs. Sub-area (SA) and terms mentioned in this paper are introduced as follow. Sub-area (SA). Given a target graph G(u) = (VG , EG , LG ) and a query graph Q (v ) = (VQ , EQ , LQ ), the sub-area SA(v ) = (V ′ , E ′ , L′ ) is the set of subgraph of G where V ′ ⊆ VG , E ′ ⊆ EG , L′ = LQ . Sub-area is a sub-part of target graph which contains and only contains all nodes with labels in LQ . Candidate matching set. Given a target graph G(u) = (VG , EG , LG ) and a query graph Q (v ) = (VQ , EQ , LQ ), the candidate matching set of each query node v can be defined as C (v ) = {u|u ∈ GV , lu = lv }, where lu is the label of u. Isolated node. Given a target graph G(u) = (VG , EG , LG ) and a query graph Q (v ) = (VQ , EQ , LQ ), isolated nodes INode = {u|u ∈ VG , lu ∈ LQ , and ∀u′ ∈ neighborhood (u), lu′ ∈ / LQ }. Isolated node is the node in target graph whose label is in LQ , and any label of the neighbors of it is not in LQ . Isolated node set. Isolated node set contains several connected nodes whose label are in LQ but the number of these connected nodes is less than the number of query graph or the number of connected nodes is no less than |VQ | but the number of unique labels of these connected nodes is less than |LQ |. Verification order. Given a query graph Q (v ) = (VQ , EQ{, LQ ), verification order is a sequence that determines the order } of query nodes to be verified. It can be defined as QR = vt · · · vq , where QR is an ordered sequence and |QR| = |VQ |. Every query node could only be added into QR once. 3.2. Framework In this paper, we propose a method for subgraph matching aiming at finding the isomorphic graph in each sub-area of target graph based on the importance of vertices. In our work, we find the central node for query graph by computing the importance of nodes. The node with the largest influence value is the most important query node. After that, candidate matching set of each query node is found on each sub-area where there may exist an isomorphic subgraph. We find the sub-areas by extending the candidate nodes of central query node. According to our strategy, if there is a subgraph in target graph which is isomorphic to the query graph, the subgraph must exist in the region of one of these sub-areas. That is to say, we just need to find the isomorphic subgraphs on these sub-areas which will greatly avoid redundant computations. What is more, the effect of isolate nodes are avoided since we can find a more suitable central node. More concretely, we first find the most influential node in query graph by computing the combination influence value for each query node v, and then finding the sub-area of target graph by extending the candidate nodes of central query node found by last step. For each target node u whose label is the same as the label of central node, only it satisfies the corresponding filtering rules can it be added into the candidate set of central node. During the extending phase, we propose several useful pruning rules to further filter unnecessary nodes in candidate matching sets. Because the sub-areas are related to the query graph, all of the embeddings can be found in the sub-part instead of being found in the large target graph. The main framework of our algorithm is shown as Algorithm 1. Algorithm 1 InfMatch 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

Input: Target graph G(u) = (VG , EG , LG ) and Query graph Q (v ) = (VQ , EQ , LQ ) Output: The set of subgraph graphs in G(QG , EG , LG ) which are isomorphic to Q (QV , EV , LV ) # Selecting the central node Computing the influence value of each query node v Selecting v with the largest influence value as the central node # Finding the candidate matching set for each node in each sub-area Getting the spanning tree of query graph Finding the candidate matching nodes of central node in data graph, only the filtering principles are satisfied can a node u be added into the candidate set of central node Exploring the sub-areas from each candidate of central node and finding the candidate matching sets Filtering invalid sub-areas # Verification Finding the matching order of query graph Finding isomorphic subgraphs on each sub-area

4. InfMatch 4.1. Finding the central node of query graph To locate the sub-area and minimize redundant calculations caused by isolated nodes, we want to find a node in query graph that is related to other nodes closely. And thus we can find the sub-areas effectively by extending the candidate

6

T. Ma, S. Yu, J. Cao et al. / Physica A 527 (2019) 121278

nodes of this node in target graph. We can achieve the aims mentioned above by calculating the influence of query nodes. In social network analysis, the influence and importance of a node is also called centrality [18]. The centrality refers that the importance of a node can be expressed by the significance of the connection between nodes, so centrality is often used to measure the most influential node. The predecessors have done a lot of work on the ordering of node importance in complex networks. Global and local methods are two main kinds of solutions for computing the influence of nodes. The local influence indicates the structural characteristics of the node and its k-hop neighborhood, while the global influence indicates the structural information of the node relative to the entire graph. There are many kinds of methods to compute the global influential degree of each node. Centrality of degree [19], centrality of the closeness [20] and centrality of the betweenness [21] were proposed in succession. Among all the global-based methods, Kshell [21] considering the centrality of the betweenness is better in terms of effect and complexity. As for the local-based methods, centrality of degree is always considered. Chen et al. [22] proposed a method considers both the nearest and the next nearest neighbors. And Sun et al. [23] proposed a method considers the α degree neighborhood. But these two methods hold that each neighborhood node has the same effect on the central node. Compared with other methods, the local influence calculation algorithm proposed by Junyi Wanga et al. [24] considers multi-hop neighbors of the node and assign different weights to each edge of the node. And thus, the contribution of different neighbor nodes to the influence value of the node is quantified. According to [21,24], the computation method proposed in [24] is better at distinguishing the spreading ability of nodes. The influence of a node are not only related to the attributes of the node itself, but also related to the global environment of the node such as the network topology and other information. Therefore, the combination of node’s local attribute metrics and the global information of the node can make it better and more comprehensively to mine important nodes in networks. As mentioned above, in this paper, we adopt the local influential computing method proposed by Junyi Wanga et al. [24]. As for the global information, we apply the Kshell algorithm [21]. 4.1.1. Global influence KShell is a rough division of the importance of nodes based on the degree of network nodes. It continuously eliminates all nodes with degree k in the network, and finally obtains the network’s hierarchical characteristics. The higher the influence of a node, the closer it is to the other nodes in the network. The Kshell value of node v is denoted as ks(v ). The detail steps of kshell algorithm are as follows. (1) All nodes whose degree k equal to 1 in the entire network are eliminated, and their connected edges are eliminated, too. (2) If a new node with degree k = 1 appears in the network after step (1) is completed, then repeat step (1) until there is no node with degree k = 1 in the network. The all removed nodes KShell value is 1. (3) In the remaining network, repeat the above operation to find the node set of ks = 2, that is, remove the nodes whose degree equal to 2. (4) Repeat above step until there is no node in the network. It will end up with a network that is divided into different KShell layers, each of which has its own KShell value and the degree of the nodes in each layer must satisfy k ⩾ ks. Due to the computational characteristics of kshell, multiple nodes in the query graph may have the same kshell value, which is not conducive to finding important nodes. Therefore, after the kshell value is obtained, this problem is avoided by calculating the entropy value of each vertex with respect to kshell value. The calculation of entropy is shown in Eq. (1), where Xi = {1, 2, 3, . . . , ksmax } represents the kshell of neighbors of node i; pi (xi ) represents the probability that the |xj | . neighbor of node i is in the j − shell layer; |xj | represents the number of nodes in the j − shell layer and pi (xj ) = ∑ksmax j=1

xj

ksmax

Infglobal (u) = Ei (Xi ) = −



pi (xj ) × log2 pi (xj )

(1)

j=1

4.1.2. Local influence There are many methods try to compute the node influence according to the attributes of the node itself. By comparison, we apply the method proposed in [24]. The algorithm mainly considers the centrality of each node and the importance of edge propagation. It believes that the importance of a node depends not only on its own central location, but also on the importance of its m-hop neighborhood nodes. At the same time, considering the importance of the link, each neighborhood node has a different impact on this node. The importance of each edge is calculated as Eq. (2), where ki and kj are the degrees of vi and vj respectively, α is a parameter and we assume that α equals to 1 according to [25].

wij = (ki · kj )

α

(2)

The calculation of it is shown in Eq. (2), where ⟨w⟩ is the average of wij , and ϕu is the centrality of vertex u, which is the degree of u in this paper. Inflocal (u) = ϕu +

∑ wij · ϕj ⟨w⟩

j∈τ (i)

(3)

T. Ma, S. Yu, J. Cao et al. / Physica A 527 (2019) 121278

7

4.1.3. Node influence In order to find the node which connects with other nodes closer in the network, we calculate the influence of nodes by combining the structure information of the network and the characteristics of the node itself. The formula can be defined as follow, where Inflocal (v ) is the local influence value of query node v , Infglobal (v ) represents the global influence value of v which is the kshell value of v in query graph. What is more, α + β = 1. InfCombine (v ) = α Inflocal (v ) + β Infglobal (v )

(4)

By extensive experiments in Section 5.2, we can see that our method combining local and global node influence value for start node choosing is more effective. 4.2. Finding candidate matching set for each node in each sub-area 4.2.1. Extending the candidate sub-areas and initializing candidate matching sets According to Section 4.1.3, we can know that the node with the largest combination influence value in the query graph is more closely related to other nodes in the query graph. In order to reduce the unnecessary calculation, our goal is to eliminate unnecessary nodes as much as possible. Selecting a node that are closely related to other nodes as central node to locate sub-areas can make it possible to eliminate as much unnecessary nodes as possible at the initial phase of candidate matching sets. At the same time, due to the selection of the central node, an area can be regarded as a candidate matching area only if it meets the complex graph structure, which also reduces the impact of isolated nodes and sets. As described above, we chose the node in query graph which has the largest influence value as the central node of query graph. Once the central node is found, the next step is to determine its candidate match set. For the node u in the target graph whose label is the same as the central node label, if it satisfies the filtering rules 1 and 2, it is the candidate matching node of the central node. Filtering strategies 1, 2 will be introduced in next subsection. Multiple candidate sub-areas can be located by candidate matching nodes of the central node. After that, each subarea is extended from each candidate node and length k is the expansion radius. Since our goal is to make the number of candidate sub-regions to be as small as possible, we hope that the radius k is as short as possible. And k is determined by the longest distance from the central node to the other query nodes. Therefore, we use BFS (Breadth First Search) to get the minimum spanning tree of the query graph and k equals to the depth of the tree. As shown in Fig. 4, it can be seen that BFS first traverses all the neighborhood nodes of the starting vertex, then traverses the one-hop neighborhood nodes of these neighborhood vertices found in last step according to a certain priority, and so on. Finally, the minimum spanning tree of BFS is obtained. The TreeLevel of the root node (the central node) is 0, and the TreeLevel of the nodes to be traversed next is 1, and so on. The extending radius is the maximum TreeLev el(v ) of the spanning tree. The definition of TreeLevel is as below: Given a query graph Q (V , E , L), the level of each node in spanning tree after BFS is defined as: TreeLev el(v ) = TreeLev el(parent(v )) + 1, where parent(v ) is the parent node of v in spanning tree. For example, In Fig. 4, TreeLev el(v3 ) = 0, the level of node v1 , v2 , v4 is 1 and TreeLev el(v5 ) = 2. Fig. 4(b) is the depth first searching result of Fig. 4(a), and Fig. 4(c) is the breadth first searching result of Fig. 4(a), as we can see that the maximum TreeLev el(v ) of the spanning tree with a depth first traverse is 3, and the maximum TreeLev el(v ) with a breadth first searching is 2. In other words, when v3 is used as a center node to find a sub-area, it only needs to judge the 1-hop neighborhood nodes and 2-hop neighborhood nodes of v3 according to the result of BFS, while according to depth first searching result, we need to judge the 1-hop, 2-hop and 3-hop neighborhood nodes of v3 . Considering the efficiency of our algorithm, the smaller the radius is, the less computation is needed. We adopt breadth first traverse instead of depth first traverse for query graph since the depth of BFS is no more than DFS (Depth First Search). After knowing the expansion radius, we can obtain candidate sub-areas by continuously expanding to the multihop neighbors of the seed nodes by using BFS. We define a list NL to save the extended nodes. The definition of NL is NL = {(u, lev el(u))|u ∈ VG }, and the meaning of lev el(u) will be explained in the following text. In the process of updating the NL, we also find candidate matching nodes for each query node. A extended node u is a candidate vertex of v if it meets filtering rules 1, 2 and 3. What is more, only if this node is a candidate vertex of query nodes can it be added to NL. We define the lev el(u) for a target graph node u to determine the expansion order of the nodes in the NL. The computation of lev el(u) is defined as: lev el(u) = lev el(u.parent) + 1, where u.parent is the neighbor node of u. The level of seed node is 0, so the 1-hop neighborhood nodes of the seed are expanded first. And the level of the added 1-hop neighborhood nodes in NL is 1. Each time we expand the 1-hop neighborhood of the node whose Lev el value is the smallest in NL. Repeating this process until the neighbor nodes in NL that are at the level of k − 1 have been extended. For example, In Fig. 5, assuming the central node of query graph is v1 and u1 is the candidate node of v1 . We can see in the list that the lev el(u1 ) is 0. After that, the neighborhood nodes of u1 whose labels are responding to query graph are added into NL, and their level value is 1. Repeating this process, we finally get a NL list as shown in Table 1. The procedure of finding sub-areas and candidate matching sets is shown as algorithm 2. Filtering rules 1–5 mentioned in algorithm 2 are introduced in 4.2.2.

8

T. Ma, S. Yu, J. Cao et al. / Physica A 527 (2019) 121278

Fig. 5. An example of the NL computation.

Fig. 6. The number of candidate regions affected by α . Table 1 The NL list of Fig. 5. u

Level(u)

u1 u2 u3 u4

0 1 1 2

T. Ma, S. Yu, J. Cao et al. / Physica A 527 (2019) 121278

9

Algorithm 2 Finding all sub-areas and candidate sets 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

Input: The expansion radius k Output: All of the sub-areas and candidate matching set for each query node in each sub-area # Finding the start node of each sub-areas Finding candidate set CS(v ) of center node # Extending sub-area for each vertex in CS(v ) Adding center node c into node queue Updating NL(c , lev el(c)) Getting the node to be extended, u = queue.remov e() Adding u into SA For each neighborhood node u of u, if u meets the pruning strategies 1-3, adding u into the queue and the candidate matching set of the query node v (lv = lu ) Repeat 8-10 until there is no node in queue # Judging whether the sub-area found can be the candidate sub-area Judging whether it meets the isJoinable principle # Repeat to find all the sub-areas Repeat 6 to 13 to find all of the sub-areas

4.2.2. Filtering strategies In 4.2.1, we mentioned that the candidate matching set of the central node need to be filtered. In fact, in the process of expanding the candidate sub-areas, it is necessary to judge whether an extended node can be added into the candidate matching set of corresponding query node. In addition, if a sub-area does not satisfy relevant filtering rules during the expansion process, the sub-area can be filtered out in advance. Since not all of the obtained candidate sub-areas contain subgraphs that are isomorphic to the query graph, after obtaining candidate sub-regions, the sub-areas that do not meet any rule are also filtered out. In summary, there are two types of filtering rules, one kind of rules is for candidate matching nodes and sub-regions when candidate sub-regions are expanded, and the other is for entire sub-regions filtering after obtaining multiple candidate sub-areas. Filtering strategies 1–3 to be introduced are the first kind of filtering strategies and strategies 4 and 5 are the second kind of filtering methods. To solve the problem that the basic pruning strategy considering the number and label of 1-hop neighbors cannot prune the candidate nodes effectively, we compute the m-hop within structural feature value of candidate nodes related to query graph according to Eq. (5). The computation of it is shown as below, where L(i), L(j) ∈ LQ . Infquery−related (u) = ϕu +



wij · ϕj , wij = (ku · kj )α

(5)

j∈τ (i)

The difference between Eqs. (3) and (5) is that Eq. (5) is query-related. The query-related means that we only consider the nodes whose label is the same as one of the query nodes when compute the local structural feature value of a target node shown as follow. The reason why we do not use the combination influence is that the size of target graph is much larger than that of the query graph. What we care about is the local feature of the target graph node in its small range instead of that in the whole graph. It is time-cost and useless to consider the global influence during filtering. In summary, in the sub-area expansion process, there are three main pruning rules, where strategy 1 is adopted by all subgraph matching algorithms and strategies 2 and 3 are proposed based on the characteristics of InfMatch algorithm. Filtering rule 1. Given a query graph Q (VQ , EQ , LQ ) and a target graph G(VG , EG , LG ), a vertex u ∈ VG cannot be the candidate matching vertex of a vertex v ∈ VQ , if |N(u)| < |N(v )|. |N(v )| and |N(u)| are the neighborhood number of v and u respectively. Filtering rule 2. Given a query graph Q (VQ , EQ , LQ ) and a target graph G(VG , EG , LG ), a vertex u ∈ VG cannot be the candidate matching vertex of a vertex v ∈ VQ , if |Infquery−related (u)| < |Infquery−related (v )|. According to Eq. (5), we can infer that Infquery−related (u) represents the structure characteristics around u that are related to query graph. That is to say, if Infquery−related (u) < Infquery−related (v ), u must not be the candidate node of v . This is because that the structure characteristics around u is not as complex as that around v when Infquery−related (u) < Infquery−related (v ). Filtering rule 3. Given a query graph Q (VQ , EQ , LQ ) and a target graph G(VG , EG , LG ), a vertex u ∈ VG cannot be the candidate matching vertex of a vertex v ∈ VQ , if Lev el(u) > TreeLev el(v ). As a node of the target graph, the network structure around node u should be more complicated compared with the network structure around node v. Therefore, if u is a candidate matching node of v , the hierarchy of u must not be greater than that of v when performing a breadth-first search. In fact, filtering rule 3 can also be used to invalid sub-areas during the expansion process. If the hierarchy of all nodes to be expanded is greater than that of the query node v and the candidate matching set of v is empty (the "Lev el(v )" layer of the candidate sub-areas has been expanded and "Lev el(v ) + 1" layer is to be expanded), this area must not contain the

10

T. Ma, S. Yu, J. Cao et al. / Physica A 527 (2019) 121278

same structure as the query graph. In this case, the sub-area can be deleted without further expansion. This is because when the hierarchy of all nodes to be expanded is greater than "Lev el(v )", according to the filtering rule 3, no other nodes will be added into C (v ) in the remaining expansion process. After the expansion process is completed, multiple candidate matching sub-areas are obtained, and further filtering is needed for these sub-areas. Therefore, we propose filtering strategies 4 and 5 to prune invalid sub-areas further. Filtering rule 4. Given a query graph Q (VQ , EQ , LQ ), and a sub-area SA(V ′ , E ′ , L′ ), this sub-area do not contain subgraphs that are isomorphic to the query graph, if ∃v ∈ VQ , |C (v )| = 0, where |C (v )| is the candidate number of nodes of v . Filtering rule 5. Given a query graph Q (VQ , EQ , LQ ), and a sub-area SA(V , E , L), if ∃v1 , v2 ∈ {v|C (v ) = 1}, C (v1 ) ̸ = C (v2 ), where C (v ) is the candidate node set of v , this sub-area do not contain subgraphs that are isomorphic to the query graph. That is to say, if the size of candidate sets of two query nodes are both one in a sub-area and the candidate matching nodes of the two query nodes are same, one query node cannot find a one-to-one matching node. However, the bijection between query nodes and nodes of target graph should be one-to-one according to the definition of subgraph isomorphism. 4.3. Verification order The matching order greatly affects the matching efficiency. A good match order can reduce many redundant calculations. A good match order can filter out nodes that do not meet the condition as soon as possible, avoiding redundant judgements. Considering the matching principles, we found that a candidate matching node which meets the edge limitation. According to the matching judgment rule, for the current query vertex to be matched, if there is an edge between it and the matched query node vm then its candidate matching vertex also needs to be corresponding to the matching vertex of M(vm ) of vm , which means that there should be an edge between these two vertices. It can be seen that if the current vertex to be verified has more edges connected to the matched vertices which need to be judged, nodes that do not meet the conditions can be found as soon as possible, and the following branches are not needed to be judged. That is, we hope that there will be a more complicated edge relationship between the nodes to be judged and the vertices that have been judged. Therefore, we hope to make a priority judgment by selecting nodes that have more complex structure relationships with the judged vertices. In order to reduce the redundant calculations described above, we considering two metrics to determine the matching order. List QR and CL are used to compute the query order. QR is a list that consists the query nodes, where the adding order is the matching order. CL consists nodes that satisfy the condition to be added to the QR which can be defined as: CL = {v|v ∈ VQ , v ∈ / QR, (V , QRu ) ∈ EQ }. In other word, the nodes that are connected with vertices in QR should be added into CL. The procedure of finding the query order is as follows: (1) The node with the greatest influence value in the query graph is selected as the start node, adding the start node into query order list QR; (2) Adding the nodes which are connected with the nodes in QR to CL; (3) Finding the node in CL with the largest influence value, and adding it to QR; (4) Repeat steps 2 and 3 until all vertices have been added to QR. 4.4. Complexity analysis Assuming that the number of distinct labels of target graph is l. And thus the average number of candidate vertices for central node without any pruning is n/l where n is the number of vertices of target graph. For each candidate node of central node, we need to expand it and get a candidate matching sub-area, which is the process of finding candidate node for each query vertex. We assuming that the query size is m. During the extending process, the neighborhood of each vertex in NL is visited and the complexity of the extending period is O( nl · m · d). As for TurboISO , the complexity of exploring candidate matching sets is lc · m · d. When choosing the central node, it favors the node with infrequent label. However, it is not so efficient when the frequency of each label is similar. By contrast, our method are suitable for a variety kinds of target graph since we find the central node that connects more closely with other nodes and it is not affected by graph structures. Actually, the complexity of our method during exploring stage is no more than TurboISO because of our central node selection strategy. And extensive experiments in Section 5.3 demonstrate it well. Since target graph is compressed in BoostIso, it is time-consuming when ∑the3 size of target graph is large. Before filtering, the time complexity of compressing the target graph is O(n · N · d + |E | · Nl ), where Nl is the number nodes with label l. The filtering time complexity of RI is O( n·lm ). However, it mainly focus on finding a better matching order and there are many invalid nodes in the candidate sets when verification. The worse verification time complexity of RI is c m , where c is the average size of candidate sets.

T. Ma, S. Yu, J. Cao et al. / Physica A 527 (2019) 121278

11

Table 2 Details of real world graphs. P2P Amazon Yeast

Number of nodes

Number of edges

Average degree

6,301 548,552 3,112

20,800 926,151 12,797

50 50 50

Table 3 Details of synthetic graphs. Synthetic Synthetic Synthetic Synthetic

graph graph graph graph

1 2 3 4

Number of nodes

Number of edges

Average degree

10,000 10,000 10,000 10,000

46,750 95,257 150,192 197,272

5 10 15 20

5. Experiment 5.1. Datasets and setup We conduct experiments on various kinds of datasets to verify the efficiency and robustness of our proposed method. All the experiments are conducted on a machine with an Intel i5 2.40 GHz CPU and 6 GB memory. The target graphs we used are introduced as follows. All of them are gotten from [26]. The details of real graphs we use is described in Table 2. (1) P2P is a social network dataset which is collected through peer-to-peer file sharing of Gnutella. In this network, nodes represent hosts in the Gnutella network topology and edges represent connections between the Gnutella hosts. 6301 nodes and 20,800 edges are contained by this network. And there are 50 extinct labels in this dataset. (2) Amazon is an undirected social network which follows the principle that customers who purchase one commodity also bought another item. This dataset is obtained by crawling the Amazon website (www.amazon.com). Actually, the graph of amazon provided by SNAP is directed. But to conduct our experiment on this dataset, we regard there is an edge between two nodes if customers often bought this two together. Therefore, the nodes represents goods in amazon while edges between two nodes reveals that customers often purchase these two goods together. 548,552 nodes and 926,151 edges are contained by this network. And there are 50 extinct labels in this dataset. (3) Yeast is a biologic dataset which contain abundant PPI information. And thus nodes in this dataset represent proteins and interaction between proteins are represented by edges. It is widely used in many papers [11,13]. The number of edges of this dataset is 12,797 while the number of nodes is 3112. To make the comparison more fair, it contains 50 extinct node labels. (4) Synthetic graph. To find the better value of variable α and β , we use synthetic graph to conduct experiments to determine the variable value. GraphGen [27] are used to generate the synthetic graph. GraphGen is a graph generator which can generate graphs whose feature are determined by users. The parameter of the synthetic graph we used is shown as Table 2. We generate four networks using the benchmark for experiment. Each network contains 10,000 nodes, and their average degrees increase in turn. For subsequent comparison experiments, we randomly assign a label value to each node. There are 30 distinct labels. The details of synthetic graphs we use is described in Table 3. Query graph. For each dataset, we test the performance of subgraph searching by using six query sets, the number of edges in each set is 4, 8, 12, 16, 20, 24 respectively. The query graphs are subgraph of data graph which are obtained by random walk. That is to say, we choose a node on data graph and then walk to its neighborhood randomly until the number of edges meets the threshold (4, 8, 12, 16 ,20, or 24). 5.2. Node influence model optimization In Section 4.1.3, we propose a method to find the central node by combining the global and local influence of the node. In this subsection, we use the four synthetic datasets mentioned in the previous subsection to find the best α value in formulation (4). In order to get the best α value, we let α equal to 0, 0.1, 0.2, 0.3, . . . ...,1 respectively. For each α value, we compare the number of candidate sub-areas under the query graph sizes 4, 8, 12, 16, 20 and 24 respectively. By analysis, it can be known that the α value is more optimal when the number of the candidate sub-areas is smaller. It should be noted that, in order to prove that our method can filter out unnecessary areas as early as possible to reduce the redundant calculation, the number of sub-areas we compare is the number of sub-regions obtained after filtering the candidate nodes of central node instead of the number of sub-areas that need to be verified after filtering. Fig. 6(a)(b)(c)(d) are the number of the initial sub-areas under the four synthetic datasets respectively. The horizontal ordinate represents different values of α , and the vertical ordinate represents the number of initial sub-areas under

12

T. Ma, S. Yu, J. Cao et al. / Physica A 527 (2019) 121278

Fig. 7. The effective of different candidate central nodes selection.

different α values. The lines in the figure represent the number of the initial sub-areas under different α values when the query size is 4, 8, 12, 16, 20, and 24 respectively. As we can see from Fig. 6, for synthetic graph 1, when the α value is 0.4 or 0.5, the result is better, while for synthetic graph 2, 3 and 4, the number of sub-areas is the smallest when the α value is 0.4. Therefore, the best value for α should set to be 0.4. 5.3. Effective of central node selection In Section 4.1.3, we propose a method to find the central node by combining the global and local influence of the node. Since the selection of the central node will affect the number of initial sub-regions, the smaller the number of initial sub-regions, the higher the efficiency of the algorithm. In this subsection, we compared our proposed central node selection method with other similar methods. It should be noted that the number of candidates in Fig. 7 refers to the number of candidate nodes obtained after the vertices filtering strategy rather than the number of sub-areas obtained after the process of sub-area extension. That is to say, Fig. 7 reflects the effectiveness of the selection of the central node. Because it is based on the filtering principles, the necessary vertices will not be pruned out. As can be seen from Fig. 7, when randomly selecting a central node, the number of initial sub-areas is the largest compared with other methods. The number of initial sub-regions obtained is greater when only calculating the local influence of the node or only calculating the global influence of the node compared to calculate node influence value by using our strategy. Therefore, using our proposed method to select the central node is better than other methods. What is more, to prove that our pruning strategy for candidate central nodes is more effective, we also compare the candidate

T. Ma, S. Yu, J. Cao et al. / Physica A 527 (2019) 121278

13

Fig. 8. Efficiency of different algorithm in different size of query graphs.

node number by using the strategy of TurboISO . As we can see in Fig. 7, our strategy can greatly reduce the number of sub-areas compared with TurboISO . What is more, when the size of the query graph reaches 8, there is a sharp decline of the number of candidate obtained by any methods. This is because when the size of the query graph is 4, the structure information of it is relatively simple. 5.4. Comparing the efficiency of algorithms In this section, we use the three real data sets introduced in Section 5.1 to conduct experiments. We compare the performance of our proposed algorithm with BoostIso [13], TurboISO [12], and RI [5]. These methods have been introduced briefly in Section 2. Fig. 8(a), (b) and (c) show the running time of verification for each algorithm under P2P, Yeast and Amazon respectively. The horizontal ordinate represents different size of query graphs, and the vertical ordinate represents the running time of verification. As can be seen from these figures, as the size of the query graph increases, the search time grows faster especially after the size exceeds 8. Although RI reduces many redundant calculations by using a reasonable matching order during verifying, the growth rate of RI is the largest. This is because the filtering period only adopts some simple strategies, which makes some unnecessary candidate nodes enter into the verifying stage. Among them, the algorithm we proposed is growing at a slower rate. In particular, compared with Fig. 8(b), although the size of P2P dataset are much smaller than that of Amazon, the query time is more than the query time of Amazon. This is because the average number of isomorphic subgraphs in P2P for each query graph is much larger than that of Amazon. In Fig. 8(c), we can see that the search time of RI is much larger than other algorithms. The size of Amazon is large, and the distinct number of labels we

14

T. Ma, S. Yu, J. Cao et al. / Physica A 527 (2019) 121278

set is small. Therefore, due to the characteristics of the RI filtering strategy, in the verification phase, the candidate set of each query node is large which results in a large number of iterations. 6. Future work It is time consuming while the size of data set is large [28]. For example, in social networks, with the development of Internet and smart phone, more and more people use social network apps to connect with others [29]. About 829 million people use Facebook every day and the number of registered users in Facebook is over one billion, which generates a vast amount of Internet-based social network data [30]. What is more, the query time is increasing with the growth of the scale of query graph [31]. Therefore, it is efficient to use parallel method to speed up the graph matching procedure. The structure of our proposed method, such as the computation of influential degree of nodes and so on, is suitable for parallel. And our next work is to apply this method to parallel platform. Spark [32,33] is a large data parallel computing framework based on in-memory computing that can be used to build large, low-latency data analysis applications. Spark improves and optimizes on the basis of the original MapReduce model and provides effective support for multiple types of calculations [34]. Spark’s ability to increase speed is reflected in the ability to run memory calculations and is more efficient than MapReduce when running complex applications on disk [35,36]. Although the Hadoop platform can adopt parallel design to achieve iterative calculations for large-scale network graph structures, the efficiency is too low. Therefore, as described above, it is essential for subgraph matching methods to use Spark for parallel to speed up the searching efficiency. 7. Conclusion In this paper, we propose a method for subgraph matching aiming at finding the isomorphic graph in each sub-area of target graph based on the importance of vertices. We consider the structural features around nodes by computing the node influence value to prune unnecessary vertices early. Based on the characteristics of InfMatch, new filtering strategies are proposed to filter further. In addition, we improve the method for determining matching order based on the influence value of vertices. According to the experiment, our algorithm is more efficient than other algorithms. What is more, in the filtering stage, our algorithm can filter out a large number of unreasonable nodes when initializing the candidate set, which can greatly reduce the time of the filtering. Acknowledgments This work was supported in part by National Science Foundation of China (No. U1736105, No. 61572259) and also supported by the National Social Science Foundation of China (No. 16ZDA054). This research project was supported by a grant from the Research Center of the Female Scientific and Medical Colleges, Deanship of Scientific Research, King Saud University, Saudi Arabia. References [1] T. Ma, W. Shao, Y. Hao, C. Jie, Graph classification based on graph set reconstruction and graph kernel feature reduction, Neurocomputing 296 (2018) 33–45. [2] T. Ma, Y. Wang, M. Tang, J. Cao, Y. Tian, A. Al-Dhelaan, M. Al-Rodhaan, LED: A fast overlapping communities detection algorithm based on structural clustering, Neurocomputing 207 (2016) 488–500. [3] K. Semertzidis, E. Pitoura, Top-k durable graph pattern queries on temporal graphs, IEEE Trans. Knowl. Data Eng. 31 (1) (2019) 181–194. [4] Y. Lv, T. Ma, M. Tang, C. Jie, T. Yuan, A. Al-Dhelaan, M. Al-Rodhaan, An efficient and scalable density-based clustering algorithm for datasets with complex structures, Neurocomputing 171 (C) (2016) 9–22. [5] V. Bonnici, R. Giugno, On the variable ordering in subgraph isomorphism algorithms, IEEE/ACM Trans. Comput. Biol. Bioinform. 14 (1) (2017) 193–203. [6] Y. Shui, Y.R. Cho, Alignment of PPI networks using semantic similarity for conserved protein complex prediction, IEEE Trans. Nanobiosci. 15 (4) (2016) 380–389. [7] D.Q. Tang, Y. Zhang, H.E. Yong-Heng, Z.H. Xiao, Research and application on crime rule based on graph data mining algorithm, Comput. Tech. Dev. 2 (2011) 337–338. [8] Z.M. Guanfeng Li, Y. Li, Pattern match query over fuzzy RDF graph, Knowl.-Based Syst. 165 (2019) 460–473. [9] J.R. Ullmann, An algorithm for subgraph isomorphism, J. ACM 23 (1) (1976) 31–42. [10] L.P. Cordella, P. Foggia, C. Sansone, M. Vento, A (sub)graph isomorphism algorithm for matching large graphs, IEEE Trans. Pattern Anal. Mach. Intell. 26 (10) (2004) 1367–1372. [11] P. Zhao, J. Han, On Graph Query Optimization in Large Networks, VLDB Endowment, 2010, pp. 340–351. [12] W.S. Han, J. Lee, J.H. Lee, Turbo ISO: towards ultrafast and robust subgraph isomorphism search in large graph databases, in: ACM SIGMOD International Conference on Management of Data, 2013, pp. 337–348. [13] X. Ren, J. Wang, Exploiting vertex relationships in speeding up subgraph isomorphism over large graphs, Proc. VLDB Endow. 8 (5) (2015) 617–628. [14] S. Ali, Q.Z. Sheng, Y. Qin, Efficient pattern matching for graphs with multi-labeled nodes, Knowl.-Based Syst. 109 (2016) 256–265. [15] D. Zhang, B.I.P. Rubinstein, J. Gemmell, D. Zhang, B.I.P. Rubinstein, Principled graph matching algorithms for integrating multiple data sources, IEEE Trans. Knowl. Data Eng. 27 (10) (2015) 2784–2796. [16] S. Hu, Z. Lei, J.X. Yu, H. Wang, D. Zhao, Answering natural language questions by subgraph matching over knowledge graphs, IEEE Trans. Knowl. Data Eng. 30 (5) (2018) 824–837.

T. Ma, S. Yu, J. Cao et al. / Physica A 527 (2019) 121278

15

[17] H. Rong, T. Ma, J. Cao, Y. Tian, A. Al-Dhelaan, M. Al-Rodhaan, Deep rolling: A novel emotion prediction model for a multi-participant communication context, Inform. Sci. 488 (2019) 158–180. [18] T. Ma, M. Yue, J. Qu, Y. Tian, A. Al-Dhelaan, M. Al-Rodhaan, PSPLPA: Probability and similarity based parallel label propagation algorithm on spark, Physica A 503 (2018) 366–378. [19] X. Tang, J. Wang, J. Zhong, Y. Pan, Predicting essential proteins based on weighted degree centrality, IEEE/ACM Trans. Comput. Biol. Bioinform. 11 (2) (2014) 407–418. [20] S. Basu, U. Maulik, O. Chatterjee, Stability of consensus node orderings under imperfect network data, IEEE Trans. Comput. Soc. Syst. 3 (3) (2016) 120–131. [21] S. Pei, L. Muchnik, J.S. Andrade Jr., Z. Zheng, H.A. Makse, Searching for superspreaders of information in real-world social media, Sci. Rep. 4 (2014) 5547. [22] T. Bian, J. Hu, Y. Deng, Identifying influential nodes in complex networks based on AHP, Physica A 391 (4) (2012) 1777–1787. [23] H. Sun, J. Huang, X. Zhong, K. Liu, J. Zou, Q. Song, Label propagation with α -degree neighborhood impact for network community detection, Comput. Intell. Neurosci. 2014 (2014) 130689. [24] J. Wang, X. Hou, K. Li, Y. Ding, A novel weight neighborhood centrality algorithm for identifying influential spreaders in complex networks, Physica A 475 (2017) 88–105. [25] B. Mirzasoleiman, M. Babaei, M. Jalili, M. Safari, Cascaded failures in weighted networks, Phys. Rev. E 84 (2) (2011) 046114. [26] http://snap.stanford.edu/. [27] Graphgen-a synthetic graph data generator, http://www.cse.ust.hk/graphgen. [28] X. Chen, H. Huo, J. Huan, J.S. Vitter, Efficient graph similarity search in external memory, IEEE Access 5 (99) (2017) 4551–4560. [29] W.F. Chen, L.W. Ku, We like, we post: A joint user-post approach for facebook post stance labeling, IEEE Trans. Knowl. Data Eng. 30 (10) (2018) 2013–2013. [30] T. Ma, Y. Zhang, C. Jie, S. Jian, M. Tang, T. Yuan, A. Al-Dhelaan, M. Al-Rodhaan, KDVEM: a k-degree anonymity with vertex and edge modification algorithm, Computing 97 (12) (2015) 1165–1184. [31] M.A. Pinheiro, J. Kybic, P. Fua, Geometric graph matching using Monte Carlo tree search, IEEE Trans. Pattern Anal. Mach. Intell. 39 (11) (2017) 2171–2185. [32] A. Gounaris, J. Torres, A methodology for spark parameter tuning, Big Data Res. 11 (2018) 22–32. [33] A. Sapountzi, K.E. Psannis, Social networking data analysis tools and challenges, Future Gener. Comput. Syst. 86 (2018) 893–913. [34] Q. Zhang, M.F. Zhani, Y. Yang, R. Boutaba, B. Wong, PRISM: Fine-grained resource-aware scheduling for mapreduce, IEEE Trans. Cloud Comput. 3 (2) (2017) 182–194. [35] M. Bakratsas, P. Basaras, D. Katsaros, L. Tassiulas, Hadoop mapreduce performance on SSDs for analyzing social networks, Big Data Res. 11 (2018) 1–10. [36] B. Zhang, X. Wang, Z. Zheng, The optimization for recurring queries in big data analysis system with mapreduce, Future Gener. Comput. Syst. 87 (2018) 549–556.