Identifying similar networks using structural hierarchy

Identifying similar networks using structural hierarchy

Physica A xxx (xxxx) xxx Contents lists available at ScienceDirect Physica A journal homepage: www.elsevier.com/locate/physa Identifying similar ne...

4MB Sizes 1 Downloads 73 Views

Physica A xxx (xxxx) xxx

Contents lists available at ScienceDirect

Physica A journal homepage: www.elsevier.com/locate/physa

Identifying similar networks using structural hierarchy ∗

Rakhi Saxena a , , Sharanjit Kaur b , Vasudha Bhatnagar c a

Deshbandhu College, University of Delhi, India Acharya Narendra Dev College, University of Delhi, India c Department of Computer Science, University of Delhi, India b

article

info

Article history: Received 7 May 2018 Received in revised form 3 November 2018 Available online xxxx Keywords: Network similarity Network comparison Structural hierarchy k-core decomposition k-truss decomposition Graph analytics Social networks Canberra distance Quantiles

a b s t r a c t Comparing structural similarities among complex networks is an important task in several scientific and social science applications. Existing techniques for quantifying network similarity range from network-centric methods that consider global network topology to node-centric methods that consider local node-level sub-structures. In this paper, we address the research gap between computationally expensive network-centric approaches and myopic node-centric network comparison methods by introducing a novel approach to quantify network similarity based on hierarchical graph decomposition. The approach adequately captures both global and local topology and is motivated by the observation that networks from diverse domains such as physical, chemical, biological and social systems exhibit an inherent structural hierarchy that emerges from local dyadic and triadic interactions. The proposed algorithm, Network Similarity via graph Decomposition (NSD), extracts network signatures from hierarchical decomposition of networks and uses Canberra distance to quantify the similarity between signatures. We use two well-known graph decomposition methods to expose network hierarchy resulting in two variations of NSD. We find that our approach groups similar networks better than competing algorithms. Experimentation using 40 real-world networks, 15 massive networks, and 30 large synthetic networks establishes that the proposed methodology is effective, scalable, sensitive and applicable to wide variety of networks. © 2019 Elsevier B.V. All rights reserved.

1. Introduction The premise that the behaviour and lives of social entities are impacted by their respective positions in the overall social structure has catalyzed research in the field of Social Network Analysis. Interestingly, the principles of social network analysis have also been applied successfully to networks of innate objects, giving a fillip to the multidisciplinary field of Complex Networks. Researchers in this field have not only generalized the theoretical underpinnings of social network analysis, but have also made significant advances that extend to diverse domains including social networks, telecommunication networks, power-grid networks, world wide web, biological networks, chemical networks, text graphs etc. [1–4]. Comparing complex systems by modelling them as networks1 is an effective strategy for quantifying similarity (or dissimilarity) between their structures. Re-engineering a telecommunication network, planning a road network ∗ Corresponding author. E-mail addresses: [email protected] (R. Saxena), [email protected] (S. Kaur), [email protected] (V. Bhatnagar). 1 We use terms network/graph, node/vertex, and edge/link interchangeably. https://doi.org/10.1016/j.physa.2019.04.265 0378-4371/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

2

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

during urban development etc. requires comparison with an efficient counterpart. Comparing networks for similarity is imperative for detection of temporal changes and anomalies in web graphs or financial networks [5,6]. Code thefts can be detected by comparing control flow graphs of executable objects [7]. Pattern matching, retrieval of symbolic images are other interesting applications of network comparison [8]. With a vast array of applications in diverse domains, measuring similarity of complex networks is a crucial task in network analysis. Quantification of structural similarity is essential for discriminating between healthy and diseased cellular networks [3], analysing functionality and topology of biological networks [3,4,9], constructing phylogenetic trees from metabolic networks [9] and clustering brain connectivity graphs of people [10]. Grouping similar social networks is essential for the longitudinal study of social groups [10,11], trade flow networks [12], scientific community [13], etc. Similarity between ( n )two networks is a function of similarities between their orders, sizes, and topological features. Since there can be (m2) distinct graphs of the same order (n) and size (m), a deeper examination of topological features of compared networks is quintessential for assessing and quantifying similarity between them. State-of-the-art network similarity (network comparison) algorithms summarize network topology in a ‘feature vector’ and quantify similarity between networks by computing distance between corresponding vectors. These algorithms differ in two aspects — the discriminating features of the network that are extracted, and the distance metric that is used to quantify similarity. There is a plethora of choices for distance metrics because of their fundamental importance in pattern recognition and information retrieval problems [14]. It is the astute design of feature vectors that determines the effectiveness and efficacy of the network similarity algorithm. 1.1. Motivation for the study Several existing network comparison algorithms elicit network-centric properties and suitably aggregate them as feature vectors [9,15–19]. These properties, extracted by considering node embeddings in the complete network, include density, diameter, degree distribution [15], number of spanning trees [16], average clustering coefficient, transitivity, modularity [17], eigenvalues of graph Laplacian [9], graphlet2 [3] distribution, summaries of network community structure [18], and densities of all realizations of k-node induced subgraphs [19]. It is noteworthy that these properties are coarse-grained, and completely disregard local level node embeddings that could be symptomatic of significant differences between networks. Further, heavy computational expense incurred for extracting network-centric properties makes them unattractive for comparison of large networks. Other network comparison algorithms use node-centric properties and aggregate them as feature vectors to capture fine-grained differences between networks [3,12,13,20,21]. Node-centric properties are computed by taking into account embedding of nodes in their immediate neighbourhood such as 1-step egonet [13], 2-step egonet [20], induced subgraph centred on node [21], graphlets centred on node [12]. Failure to capture global topology of the network limits power of these algorithms to expose latent differences in the network. In this paper, we address the research gap between computationally expensive network-centric approaches and myopic node-centric network comparison methods. We exploit structural hierarchy, which is a network-centric property, extracted from the local embedding of nodes in dyads and triads. Structural hierarchy has been recognized as an organizational property of many man-made systems, ranging from companies, cities, societies, to road networks, autonomous systems and the Internet [22–24]. Recent works in computational biology provide ample evidence of hierarchical organization in gene regulatory, protein interaction, metabolic and ecological networks [23,25,26]. Encouraged by the importance of hierarchy, we leverage signals obtained from hierarchy levels to reveal finer structural distinctions between graphs. The approach yields the advantage of capturing local topology within the perspective of the entire network and is scalable. 1.2. Contributions Research contributions of the paper are listed below. i We establish that local node embeddings examined in global perspective generate effective network signatures for capturing latent differences between two networks. ii We propose Network Similarity via graph Decomposition (NSD) algorithm to quantify network similarity using features derived from hierarchical graph decomposition. iii We present two variations of NSD algorithm, viz. NSD-C and NSD-T, based on two different hierarchical graph decomposition methods. iv We empirically establish superiority of the proposed variations compared to four state-of-the-art network comparison algorithms. We also demonstrate effectiveness, scalability and sensitivity of NSD-C and NSD-T algorithms through extensive experimentation. We present the proposed algorithm NSD in Section 2. Two popular methods for hierarchical graph decomposition used by the proposed algorithm are described in Section 3. Section 4 details two variations of algorithm NSD based on different methods used to decompose graphs hierarchically. Empirical investigations are presented in Section 5. A survey of network similarity algorithms is given in Section 6. Section 7 concludes the paper. 2 A node’s egonet is the induced subgraph of its neighbouring nodes, graphlets are sub-graphs of k nodes. Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

3

2. The proposed algorithm Our approach is based on the conjecture that features derived from arrangement of nodes in network hierarchy are discriminating signatures of its overall structure. The basis of the conjecture is that hierarchical structure is the primary factor guiding social interactions [27], and can explain many commonly observed topological properties of networks [28]. Hierarchy in a network is a partition of its vertex set based on the way nodes are embedded in the network. Function P assigns a hierarchy level to vertices by computing a specific property of the node derived from its embedding in the network structure. Formal definition adapted from [29] follows: Definition 2.1 (Hierarchy). Let G = (V , E) be a simple, connected, undirected graph, where V represents the set of vertices and E ⊆ V × V represents the set of edges. P is a function defined on a structural node property such that (P : V → N+ ). A hierarchy HP on G is a partition of V into k subsets {V1 , . . . Vk }, such that ∀v ∈ Vl , P(v ) = ρl , l = 1, . . . , k. Vertices that share the same property value ρl belong to partition Vl ⊆ V . Without loss of generality, w may assume that ρl < ρl+1 . The vertex partitions are mapped to distinct hierarchy levels in the network. Nodes in V1 are considered to be at level l1 , which is lower than that of the nodes in V2 (i.e. l2 ), and so on. Hierarchical decomposition of a network reveals the underlying hierarchy in the network by organizing it into levels that impart a natural ordering to the nodes in the network. We leverage hierarchy based signals to propose novel algorithm Network Similarity via (hierarchical) Decomposition (NSD) to assess similarity between networks. Given a network, the proposed algorithm (i) decomposes it hierarchically to determine the hierarchy levels of vertices, (ii) creates feature vectors from two hierarchy based node features, namely, hierarchy-level of nodes and their hierarchy-affinity, (iii) composes network signature by suitably aggregating the feature vectors, and (iv) quantifies similarity between networks by comparing their signatures. In the following subsections, we elaborate on the details of NSD algorithm. 2.1. Node features extracted from network hierarchy Interactions with neighbours at different hierarchy levels give rise to distinct topological patterns in the network. Accordingly, the following two features are extracted for nodes based on their respective positions in network hierarchy in order to reveal structural distinctions between networks. (i) Hierarchy-level: NSD algorithm uses hierarchy-level (ρi ) of vertex vi as discriminating feature since earlier studies have demonstrated that nodes at the same level of hierarchy have similar capability to disseminate information [30,31] and influence others [29]. (ii) Hierarchy-affinity: Assortativity of networks is the tendency of vertices to connect preferentially to other vertices that are like or unlike them [32]. The tendency of individual nodes for assortative mixing has been found to be informative for understanding local structures in the network and network design [33–35]. Assortativity affects structural and dynamic properties of real-world networks, such as error tolerance, epidemic spreading [36,37] and robustness [38]. Owing to the sensitivity of assortativity to the underlying fine-grained topology of the network it is an appropriate feature for network discrimination. While retaining the basic tenet of assortativity, NSD algorithm improvises by assessing tendency of individual nodes to link with nodes at the same hierarchy level as themselves. This tendency (hierarchy-affinity - η) is quantified for each node as the fraction of neighbours at the same level as the node itself. Let Ni = {vj ∈ V |eij ∈ E } denote the set of neighbours of vertex vi ∈ V and let Li = {vj : vj ∈ Ni ∧ ρj = ρi } denote the set of neighbours of vi at the same hierarchy level as itself. Then, hierarchy-affinity of vi is computed as:

ηi =

|Li | |Ni |

2.2. Composing and comparing network signatures Having obtained two features for each node in the network, NSD algorithm preserves hierarchy-level and hierarchyaffinity of vertices in the network in feature vectors ρ ⃗ and η⃗ respectively. The nature of distribution of the feature values in the vectors is captured from their respective descriptive statistics. The empirical cumulative distribution function (CDF) is a useful summaries of univariate distribution. Quantiles derived from CDF locate the boundaries of specified areas in the distribution of random variables and are therefore appropriate to assess distributional similarities and differences. NSD algorithm aggregates feature vectors ρ ⃗ and η⃗ using quantiles and composes network signature by concatenating the two quantiles. Network similarity between a pair of networks is quantified using Canberra distance between their respective hierarchy based signatures. This distance measure has been reported to be effective for discriminating networks [13]. Larger the distance, more dissimilar are the compared networks. Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

4

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

Fig. 1. Core Decomposition using toy network (a) Original Network (1-core), (b) 2-core subgraph, (c) 3-core subgraph (d) 4-core subgraph.

3. Discovering network hierarchy The foundation of NSD algorithm is hierarchical decomposition of networks. We use two well-known efficient methods to reveal underlying network hierarchy. The first is dyad3 -based k-core decomposition [39] and the second is triad-based k-truss decomposition [40]. Application of either of the graph decomposition methods reveals node positions in overall hierarchical topology of the network. We briefly explain the two graph decomposition methods below. 3.1. Core decomposition A dyad in a social network is the smallest structure in which an actor can be embedded, and is the basic building block of social interaction. Hierarchy arising from node embedding in dyad reveals the underlying social structure of the network and hence is an appropriate signal for comparing graphs. The hierarchy is revealed by the k-core decomposition method that organizes a graph into subgraphs in which every vertex participates in at least k dyads (i.e. has degree at least k). Definitions adapted from [39] follow. Definition 3.1. k-core of G is a subgraph Ck = (Vk , Ek |Vk ) iff ∀vi ∈ Vk : δi >= k and Ck is the maximal subgraph with this property. □ Here δi = |Ni | denotes the degree of vertex vi . Definition 3.2. Coreness of vertex vi denoted by κi is k iff vi ∈ Ck ∧ vi ∈ / Ck+1 .



Example 3.1. Fig. 1 illustrates the progressive core decomposition process using a toy network with 16 nodes and 34 edges. Sub-figures (a–d) show the hierarchy of subgraphs that represent increasingly denser regions of the graph. Nodes and edges not contained in k-core are dimmed for increasing values of k. Each edge in the 2-core is part of at least 2 dyads (Fig. 1(b)), each edge in the 3-core is part of at least 3 dyads (Fig. 1(c)), and each edge in the 4-core is part of at least 4 dyads (Fig. 1(d)). Coreness of vertex F is 3 since it belongs to the 3-core but not to the 4-core. □ 3.2. Truss decomposition In a social network, the participation of a tie in a closed triad (triangle) implies a strong bond between two actors because it is reinforced by a common neighbour. Triadic structures ‘embed’ dyadic relations and capture higher order social structure of the network. Several important social theories such as structural balance, clusterability, ranked clusters, and transitivity have been expressed in triadic terms [41]. Intuitively, the strength of tie between two social actors is proportional to the number of their common neighbours. It is noteworthy that the number of common neighbours of the endpoints of an edge is equal to the number of triangles in which that edge participates. Thus, the participation of an edge in closed triads is an effective measure of the strength of tie [42]. Wang et al. define tie strength as support σij and quantify it as the count of triangles in which the edge eij participates [43]. Formally, 3 A dyad/triad is a subgraph of two/three nodes and links between them. Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

5

Fig. 2. Truss Decomposition using toy network (a) Original Network (2-truss), (b) 3-truss subgraph, (c) 4-truss subgraph (d) 5-truss subgraph.

Definition 3.3. Support of edge eij denoted by σij is |Ni ∩ Nj |.



Quantification of the strength of tie as edge support leads to natural stratification of the network where the strata can be identified by the inherent cohesion among the vertices. The k-truss decomposition algorithm reveals the nested hierarchy of subgraphs with the most cohesive subgraph at the top of the hierarchy. Definitions adapted from [43] follow. Definition 3.4. k-truss of G is a subgraph Tk = (Vk , Ek |Vk ) iff ∀eij ∈ Ek : σij >= (k − 2), and Tk is the maximal subgraph with this property. □ Definition 3.5. Trussness of edge eij denoted by ⊤ij is k iff eij ∈ Tk ∧ eij ∈ / Tk+1 .



We define trussness of nodes in the graph. A node that belongs to the k-truss but not to the (k+1)-truss has trussness k. Formally, Definition 3.6. Trussness of node vi denoted by τi is k iff vi ∈ Tk ∧ vi ∈ / Tk+1 .



Example 3.2. Fig. 2 illustrates the progressive truss decomposition process of the toy network. Sub-figures (a–d) show the hierarchy of increasingly denser subgraphs. Edges not contained in k-truss are dimmed for increasing values of k. The edge between vertices L and M has trussness 2 because it is not contained in any triangle. Each edge in the 3-truss is part of at least 1 triangle (Fig. 2(b)), each edge in the 4-truss is part of at least 2 triangles (Fig. 2(c)), and each edge in the 5-truss is part of at least 3 triangles (Fig. 2(d)). The edge between vertices P and O has trussness 4 because it is part of the 4-truss but not the 5-truss. Node trussness of vertex P is 4 because it belongs to the 4-core but not to the 5-core. □ 4. Variations of NSD algorithm Two variations of NSD algorithm result from using the above-mentioned methods for hierarchical graph decomposition. The first variation NSD-C extracts features from the k-core decomposition of the graph while the second one NSD-T extracts features from the k-truss decomposition of the graph. In algorithm NSD-C, feature hierarchy-level (ρi ) of vertex vi is set to its coreness. The other feature hierarchy-affinity (ηi ) of vi is computed as fraction of its neighbours with same coreness as itself. Similarly, in NSD-T, the two hierarchy based features are computed using trussness of nodes. Example 4.1. Consider Fig. 1, which shows core-based decomposition of a toy network. Vertex E has coreness 4, therefore ρE = 4. Also, ηE = 64 , since out of its 6 neighbours, vertex E has 4 neighbours with coreness 4. □ Example 4.2. Consider Fig. 2, which shows truss-based decomposition of a toy network. Vertex D has trussness 5, therefore ρD = 5. Also, ηD = 47 , since vertex D has 4 neighbours with trussness 5 out of its 7 neighbours. □ For hierarchical decomposition, algorithm NSD-C uses efficient O(|E |) k-core decomposition algorithm as proposed in [44] and NSD-T uses in-memory O(|E |1.5 ) k-truss decomposition algorithm proposed in [43] so that large network decomposition is feasible on a consumer-grade machine. Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

6

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

Table 1 Structural properties of networks (Real-1). n - Number of Nodes, m - Number of Edges, d¯ - Average Degree, kmax - Maximum Degree, gcc - Global Clustering Coefficient, K - Highest Core Level, T - Highest Truss Level. Class

Network

n



dmax

gcc

Autonomous Systems (AS) [45]

AS-1 AS-2 AS-3 AS-4 AS-5

10,670 10,729 10,790 10,859 10,886

22,003 22,000 22,470 22,748 22,494

4.1257 4.1010 4.1649 4.1897 4.1326

2312 2315 2337 2355 2367

0.0093 0.0085 0.0094 0.0097 0.0089

17 15 17 18 17

16 14 15 15 15

AS peering information inferred from Oregon route-views between March 31 2001 and May 26 2001

Co-Author Networks (CA) [45]

CA-1 CA-2 CA-3 CA-4 CA-5

18,772 23,133 5,242 12,008 9,877

198,110 93,497 14,496 118,521 25,998

21.1016 8.0748 5.5271 19.7382 5.2603

504 279 81 491 65

0.3180 0.2643 0.6298 0.6594 0.284

56 25 43 238 31

57 26 44 239 32

Scientific collaborations between authors in papers from the e-print arXiv in various categories

11,509 36,371 14,916 17,444 6,472

486,967 1,590,655 686,501 801,853 266,378

84.6236 87.4683 92.0489 91.9345 82.3170

1377 6312 1602 4459 1124

0.1442 0.0998 0.1429 0.1441 0.1679

69 81 77 86 64

37 62 48 70 44

Facebook friendship networks at American colleges

Facebook Graphs (FB) [46]

FB-1 FB-2 FB-3 FB-4 FB-5

128 128 183 97 97

2 137 2 106 2 494 1 491 1 492

33.3906 32.9062 27.2568 30.7422 30.7628

110 110 106 104 103

0.3142 0.3119 0.3322 0.4180 0.4232

24 23 24 23 23

11 11 15 13 13

Trophic networks of which species eats which

Food Web Networks (FW) [47]

FW-1 FW-2 FW-3 FW-4 FW-5

Metabolic Networks (ME) [48]

ME-1 ME-2 ME-3 ME-4 ME-5

1,268 490 1,082 751 767

3,011 1,163 2,589 1,768 1,796

4.7492 4.7469 4.7855 4.7083 4.6831

191 86 160 114 111

0 0 0 0 0

5 5 5 5 5

2 2 2 2 2

Networks of metabolic pathways of various organisms

Peer-to-Peer Networks (PP) [45]

PP-1 PP-2 PP-3 PP-4 PP-5

10,876 8,846 8,717 6,301 8,114

39,994 31,839 31,525 20,777 26,013

7.3545 7.1985 7.2329 6.5948 6.4118

103 88 115 97 102

0.0054 0.0075 0.0081 0.0206 0.0171

7 9 9 10 10

4 4 4 5 5

Snapshots of the Gnutella peer-to-peer file sharing network in August, 2002

54,870 87,804 94,893 10,800 63,336

1,311,227 2,565,054 3,260,965 410,400 1,596,876

47.7939 58.4268 68.7293 76.0121 50.4255

275 131 299 155 89

0.5549 0.4557 0.4986 0.6351 0.5766

35 47 41 73 37

24 36 36 72 30

Scientific networks

Scientific Networks (SC) [49]

SC-1 SC-2 SC-3 SC-4 SC-5

World Trade Networks (WT) [50]

WT-1 WT-2 WT-3 WT-4 WT-5

231 232 234 220 220

15,000 15,106 14,919 15,092 14,948

129.8701 130.2241 127.5128 137.2 135.8909

225 224 226 219 219

0.7498 0.7505 0.7402 0.7728 0.7656

106 107 105 101 109

98 100 96 101 100

Bilateral trade flow between countries from 2010 to 2014

m

T

K

Description

Table 2 Structural properties of networks (Real-2). n - Number of Nodes, m - Number of Edges, d¯ - Average Degree, kmax - Maximum Degree, gcc - Global Clustering Coefficient, K - Highest Core Level, T - Highest Truss Level. m



SN-1 SN-2 SN-3 SN-4 SN-5

80,676 87,804 94,653 94,893 151,926

2,194,830 2,652,858 3,803,485 3,260,965 7,494,215

54.41 60.42 80.36 68.73 98.66

Road Networks (RO) [45]

RO-1 RO-2 RO-3 RO-4 RO-5

18 263 174 956 1,088,092 1,379,917 1,965,206

23,874 223,001 1,541,898 1,921,660 2,766,607

2.61 2.55 5.67 5.57 5.63

Web Networks (WB) [51]

WB-1 WB-2 WB-3 WB-4 WB-5

415,641 163,598 325,729 281,903 325,557

3,284,387 1,747,269 1,497,134 2,312,495 3,216,151

15.80 21.36 9.19 16.40 19.76

Class

Structural Networks (SN) [51]

Network

n

dmax

gcc

T

K

Description

89 131 4 145 299 332

0.57 0.46 0.16 0.49 0.47

37 49 70 41 82

24 36 63 36 60

8 8 18 24 24

0.03 0.03 0.06 0.06 0.06

3 3 6 6 6

3 4 4 4 4

Road Road Road Road Road

network network network network network

255 101 307 86 163

95 102 155 62 84

Web Web Web Web Web

graph graph graph graph graph

127 090 1 102 10 721 38 626 18 278

0.0006 0.95 0.08 0.008 0.008

Structure Structure Structure Structure Structure

of of of of of

tower silo Cofferdam Jijian Plaza machine element tall building of of of of of

San Joaquin County San Francisco Pennsylvania Texas California

from Baidu from UbiCrawler of Notre Dame of Stanford from Italian CNR domain

Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

7

Fig. 3. Percentiles of core-based hierarchy-level of networks belonging to 4 different classes from Table 1. There are 5 networks in each class.

5. Experimental evaluation The goal of experimental evaluation is to assess the performance of NSD-C and NSD-T algorithms w.r.t. their effectiveness, scalability and sensitivity. We evaluated the performance of proposed algorithms against four recent network similarity algorithms. We selected network-centric algorithms Ntangle [19], node-centric algorithm NetSimile [13] and NetEMD [12] and hierarchy-based algorithm NCKD [52] for comparison.4 The empirical study is specifically designed to examine the following questions. i ii iii iv v

Are the two features (hierarchy-level, hierarchy-affinity) individually adequate for network comparison? (Section 5.2) Which distance measure is the best for comparing hierarchy based network signatures? (Section 5.3) How effective is the proposed network similarity measure? (Section 5.4) Are the two proposed network similarity algorithms scalable? (Section 5.5) Are the proposed network similarity algorithms sensitive towards random perturbations in networks? (Section 5.6)

We implemented NSD-C and NSD-T algorithms in Python (64bits, v 2.7.3) using igraph library and executed them on Intel Core i7-6700 CPU @3.40 GHz with 8GB RAM, running UBUNTU 16.04. In the following sub-sections, we describe the network datasets5 used for the investigation and report the results of experimentation. 5.1. About datasets We performed experiments with real-world and synthetic datasets. The first real-world dataset (Real-1) shown in Table 1 consists of 40 publicly available real-world networks from 8 different classes. These networks are of varying sizes and order ranging from small networks (e.g., FW-4 with ≈ 100 nodes and ≈ 1500 edges) to large networks (e.g., SC-3 with ≈ 95,000 nodes and ≈ 3 million edges). The second real-world dataset (Real-2) shown in Table 2 consists of 15 large and sparse networks belonging to three classes. For the synthetic dataset we generated 30 networks based on Erdös–Rényi, Forest-Fire, and Watts–Strogatz models using igraph package of Python. These networks are used for scalability experiment and are described in Table 3. 5.2. How good are the hierarchy-based features? Before proceeding to evaluate performance of proposed algorithm, we evaluate goodness of two hierarchy based features by assessing their propensity for capturing network similarity. We examine 20 networks from four selected classes of networks — five each from AS, FB, FW, and PP (See Table 1 for network descriptions). We extract core-based and truss-based hierarchy for each network to compute feature vectors. The plots of percentiles of hierarchy-level and hierarchy-affinity are shown in Figs. 3 to 6. Striking similarity between percentiles of features of networks belonging to same class for both core- and truss-based decomposition is clearly evident. The observation affirms the conjecture that hierarchy based network features are effective network discriminators. As individual plots clearly show a distinctive pattern, we question if each feature is individually sufficient for network discrimination. In order to ensure that we do not induce additional computational burden by combining two features, we compare the performance of signatures using individual features as well as combined features. Given a set S = {G1 , G2 , . . . Gx } of x networks from y disjoint classes C = {C1 , C2 , . . . Cy }, a network similarity measure should position networks from the same class closer to each other compared to networks from other classes. Wegner et al. [12] propose P¯ metric to quantify this ability. Given a network G, the ability of network similarity measure D to group similar networks can be measured in terms of empirical probability P(G) = P(D(G, G1 ) < D(G, G2 )), where G1 is a randomly selected 4 We acknowledge with gratitude Wegner et al. [12] and Gallos et al. [19] for making the source code available for download. NetEMD code was downloaded from http://opig.stats.ox.ac.uk/resources and Ntangle code from http://home.dimacs.rutgers.edu/~lgallos/research_comparison.html. 5 Python source code and datasets available at https://github.com/rakhisaxena/NSD. Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

8

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

Table 3 Structural properties of synthetic networks constructed using generative models. n - Number of Nodes, m - Number of Edges, d¯ - Average Degree, kmax - Maximum Degree, gcc - Global Clustering Coefficient, K - Highest Core Level, T - Highest Truss Level. K in network nomenclature means 1000. m



Erdös–Rényi (ER)

ER100K ER200K ER300K ER400K ER500K ER600K ER700K ER800K ER900K ER1000K

98,259 196,342 294,461 392,679 490,898 588,952 687,261 785,152 883,466 981,738

200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000

4.07085 4.0745 4.0752 4.0745 4.0741 4.0750 4.0741 4.0756 4.0748 4.0744

15 16 16 17 16 17 16 17 16 18

Forest Fire (FF)

FF100K FF200K FF300K FF400K FF500K FF600K FF700K FF800K FF900K FF1000K

100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

590,440 1,180,669 1,773,005 2,363,017 2,957,731 3,546,832 4,135,999 4,728,873 5,319,114 5,907,205

11.8088 11.8066 11.8200 11.8150 11.8309 11.8227 11.8171 11.8221 11.8202 11.8144

741 996 1270 1280 1718 1938 1768 2012 2594 2681

Watts–Strogatz (FF)

WS100K WS200K WS300K WS400K WS500K WS600K WS700K WS800K WS900K WS1000K

99,999 200,000 299,998 399,998 499,997 599,994 699,994 799,995 899,994 999,995

400,000 800,000 1,200,000 1,600,000 2,000,000 2,400,000 2,800,000 3,200,000 3,600,000 4,000,000

8.00008 8 8.00005 8.00004 8.00004 8.00008 8.00006 8.00005 8.00005 8.00004

19 18 19 19 20 19 20 19 19 20

Class

Network

n

dmax

gcc

K

T

Description

3.76E−05 1.31E−05 2.00E−05 5.62E−06 1.20E−05 2.50E−06 6.43E−06 3.75E−06 5.83E−06 3.75E−06

3 3 3 3 3 3 3 3 3 3

3 3 3 3 3 3 3 3 3 3

Generator G (n,m = 2n)

0.0520 0.0504 0.0500 0.0496 0.0484 0.0488 0.04935 0.04878 0.0479 0.0478

8 8 8 8 8 8 8 9 9 8

6 6 6 7 6 7 7 7 7 7

Generator G (n, f = 0.3, b = 0.2, a = 4)

0.0704 0.0707 0.0705 0.0703 0.0703 0.0705 0.0705 0.0704 0.0704 0.0705

5 5 5 5 5 5 5 5 5 5

5 5 5 5 5 5 5 5 5 5

f: Forward burning probability b: Backward burning ratio a: Number of ambassador vertices

Generator G (d = 1,nei = 4, p = 0.3) d: Lattice dimension nei: Neighbourhood of vertex connectivity p: Rewiring probability

Table 4 P¯ values for comparison of hierarchical features on Real-1 networks (Table 1).

Core-based Truss-based

Hierarchy-level

Hierarchy-affinity

ρ

η

Both

ρ∥η

0.9648 0.9826

0.9812 0.9896

0.9901 0.9967

Fig. 4. Percentiles of core-based hierarchy-affinity of networks belonging to 4 different classes from Table 1. There are 5 networks in each class.

network from the same class as G, and G2 is a randomly selected network from a different class [12]. For set of networks S, measure P¯ is computed as below: P¯ =

1 ∑

|S |

P(G)

(1)

G∈S

We computed P¯ values for the Real-1 networks shown in Table 1 when the network signature is composed of (i) only hierarchy-level (ρ ), (ii) only hierarchy-affinity (η) and (iii) combined features (ρ ∥ η). Table 4 shows P¯ values for both core-based and truss-based network signatures. It is observed that performance of hierarchical features when used individually is fairly good, with hierarchy-affinity performing slightly better than hierarchy-level. However, combining Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

9

Fig. 5. Percentiles of truss-based hierarchy-level of networks belonging to 4 different classes from Table 1. There are 5 networks in each class.

Fig. 6. Percentiles of truss-based hierarchy-affinity of networks belonging to 4 different classes from Table 1. There are 5 networks in each class. Table 5 Performance of four distance measures for hierarchical clustering of Real-1 networks (Table 1). Distance measure

Purity

Precision

Recall

Accuracy

NMI

Canberra Cosine Earth Mover’s Euclidean (b) NSD-T algorithm

1.0 0.7 0.475 0.525

1.0 0.5161 0.265 0.2014

1.0 0.8 0.3875 0.725

1.0 0.9026 0.8269 0.6769

1.0 0.8242 0.5483 0.6358

Canberra Cosine Earth Mover’s Euclidean

0.875 0.65 0.675 0.525

0.7475 0.4604 0.5172 0.2182

0.925 0.8 0.5625 0.75

0.9603 0.8833 0.9013 0.6987

0.9367 0.7689 0.7227 0.6268

(a) NSD-C algorithm

the two hierarchical features marginally boosts the power of the network signature to discriminate between networks. Since there is no extra burden in using hierarchy-level feature, we use combined features in the network signature. 5.3. Comparative performance of distance measures A wide plethora of distance measures exists in literature [14]. We empirically evaluate four distance measures commonly used in network comparison algorithms — Canberra, Cosine, Earth Mover’s and Euclidean Distance. We first extract hierarchy based signatures for networks in Real-1 dataset (Table 1) and compute pairwise distance between the signatures using selected distance measures to obtain four distance matrices. Subsequently, we use each distance matrix to cluster the networks using hierarchical agglomerative technique.6 Clustering is used as the mechanism to compare the distance measures because the effectiveness of a network similarity method is positively correlated with the quality of clustering scheme delivered by it. We use five metrics — purity, precision, recall, accuracy, and NMI [53] to measure the quality of clustering schemes corresponding to four distance measures. Table 5 shows comparative performance of four distance measures on the quality metrics of the clustering of the networks, with best metric shown in bold. It is clear that Canberra distance outperforms the rest of measures for both algorithms, affirming its effectiveness for network discrimination documented in [13]. 5.4. Comparative evaluation of NSD-C and NSD-T with recent methods We perform comparative evaluation of NSD-C and NSD-T algorithms with four state-of-the-art algorithms — NetSimile, Ntangle, NetEMD and NCKD. We use Real-1 and Real-2 network datasets (Tables 1 and 2 respectively) for this experiment. 6 hclust and cutree functions of stats package in R were used for agglomerative clustering and to cut dendrogram by specifying a known number of classes. Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

10

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

Fig. 7. Dendrogram for networks described in Table 1. Networks belonging to same genre have same colour.

We perform hierarchical agglomerative clustering of the networks using distance matrices delivered by competing algorithms and compute quality metrics of the clustering scheme. Assuming that networks belonging to the same class are structurally similar, we expect them to cluster together. Fig. 7 shows dendrograms for the clustering thus obtained. Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

11

Table 6 Quality metrics for hierarchical clustering of networks. Networks Real-1 networks (40 networks/8 classes)

Real-2 networks (15 networks/3 classes)

Algorithm

Purity

Precision

Recall

Accuracy

NMI

NSD-C NSD-T NetSimile Ntangle NCKD NetEMD

1.0 0.88 0.93 0.8 0.63 0.63

1.0 0.75 0.84 0.64 0.31 0.34

1.0 0.93 0.87 0.82 0.85 0.71

1.0 0.96 0.97 0.93 0.79 0.82

1.0 0.94 0.92 0.87 0.76 0.69

NSD-C NSD-T NetSimile Ntangle NCKD

1.0 1.0 0.8 0.73 0.67

1.0 1.0 0.61 0.53 0.51

1.0 1.0 0.8 0.76 0.87

1.0 1.0 0.8 0.74 0.73

1.0 1.0 0.72 0.57 0.67

Table 7 P¯ values for comparison of network similarity algorithms. Networks

NSD-C

NSD-T

NetSimile

Ntangle

NCKD

NetEMD

Real-1 (Table 1) Real-2 (Table 2)

0.99

0.99

0.97

0.90

0.96

0.85

0.933

0.967

0.92

0.79

0.85



Table 8 Signature generation time (in seconds) for largest size network in each class of Real-1 and Real-2 datasets (descriptions in Tables 1 and 2). ‘–’ denotes that the algorithm was unable to complete even after running for 72 h. Network

NSD-C

NSD-T

NetSimile

Ntangle

NCKD

NetEMD

ME-1 FW-3 WT-2 PP-1 AS-5 CA-1 FB-2 SC-3

0.03 0.09 0.10 0.18 0.19 1.68 7.47 8.56

0.12 0.05 2.23 1.44 1.74 12.09 247.41 384.44

0.32 0.20 13.31 3.69 36.07 149.26 10 996.75 561.64

0.64 0.42 1.55 3.78 4.06 6.82 21.44 18.25

0.01 0.01 0.02 0.03 0.03 0.15 1.18 0.72

8.36 0.60 67.41 4.11 76.07 222.95 98 110.06 –

WB-1 RO-5 SN-5

11.77 29.72 26.25

174.26 122.74 1942.71

1330.08 19 893.38 > 48 h

61.98 716.88 72.01

1.24 2.59 4.08

– – –

Visual observation of Fig. 7 shows that NSD-C algorithm clusters Real-1 networks perfectly. Algorithm NSD-T separates CA networks into two groups and places ME and PP networks in the same cluster. Metabolic networks (ME) being bipartite have an absence of closed triads and hence all nodes are at the same hierarchy level in truss based decomposition. This renders them indistinguishable by NSD-T and explains the relatively weaker performance of NSD-T algorithm in this experiment. Algorithm NetSimile mixes SC networks with FW and FB networks. Ntangle algorithm is unable to distinguish between FW, SC and CA networks. Performance of NCKD and NetEMD is relatively weaker compared to other algorithms. Dendrograms obtained by clustering Real-2 networks are shown in Fig. 8. NetEMD algorithm was dropped for comparison as it was unable to process graphs of order >100k even after running for more than 72 h. Both NSD-C and NSD-T algorithms perform perfect clustering for this experiment. Algorithm NetSimile groups the RO networks correctly, but mixes the WB and SN networks. Ntangle algorithm mixes the WB networks with both SN and RO networks. Apparently, NSD-C algorithm performs the best for both Real-1 and Real-2 datasets. We compute quality metrics for quantitative analysis of the clustering schemes delivered by competing algorithms. Table 6 presents the metrics for clustering Real-1 and Real-2 networks. The results vindicate the visual observation. NSD-C scores highest for all five quality metrics. Performance of NSD-T algorithm is comparable to NetSimile algorithm and better than Ntangle, NetEMD and NCKD algorithms. We also compute P¯ for the competing algorithms for both datasets and show the values in Table 7. These results reaffirm the conclusion that algorithms NSD-C, as well as NSD-T, have superior ability to discriminate between networks compared to NetSimile, Ntangle, NCKD and NetEMD algorithms. 5.5. Scalability We assess scalability of NSD-C and NSD-T algorithms using both synthetic and real-world large networks. Real-world networks demonstrate the applicability of the algorithms in practical settings while synthetic datasets allow controlled variation of data characteristics to examine scalability. Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

12

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

Fig. 8. Dendrograms for networks described in Table 2. Networks belonging to same genre have same colour.

We selected the largest size network from each class of networks in the Real-1 and Real-2 datasets (Tables 1 and 2) and noted the time for signature generation by the five competing algorithms. All algorithms were executed five times for each network to average out the effect of system activities on timing observations. The execution timings are shown in Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

13

Fig. 9. Signature generation time of NSD-C, NSD-T, NetSimile and NCKD for synthetic networks (descriptions in Table 3).

Table 8. Algorithm NCKD is the fastest as it uses efficient O(|E |) core decomposition and employs coarse-grained hierarchy feature, namely distribution of coreness, for network discrimination. NSD-C also uses O(|E |) core decomposition as its basis but extracts more complex hierarchical feature compared to NCKD algorithm and hence is a bit slower. Ntangle algorithm runs slower than NSD-C but faster than NSD-T for some networks. The faster execution speed of Ntangle is attributed to its Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

14

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

implementation in C compared to Python implementations of all other algorithms. Though NCKD and Ntangle algorithm are fast, previous experiments have demonstrated their relatively weaker performance for network comparison. We also performed scalability analysis on 30 graphs generated using three generative models (description in Table 3) with number of nodes varying from 100k to 1000k in steps of 100k, and edges varying proportionally depending on the model. As shown in Fig. 9, all algorithms exhibit approximately sublinear growth in execution timings for each model. Algorithms NSD-C and NSD-T both run much faster than NetSimile. NCKD algorithm runs the fastest because it extracts computationally efficient features from the structural hierarchy in networks. Ntangle algorithm uses Monte Carlo sampling to reduce its computational complexity with the number of samples as input parameter. For fair comparison, we used identical number of samples as input parameter for all networks and therefore the plot shows similar runtimes for all three generative models. Since k-core decomposition algorithm is O(|E |), time increases linearly with edges in case of NSD-C algorithm. The relatively short execution times of NSD-C and NSD-T strengthen the claim of scalability of both algorithms. 5.6. Sensitivity of NSD-C and NSD-T The sensitivity of a network comparison algorithm is an important quality when it is vital to detect minor aberrations in case of missing data and expose structural differences due to noise. We evaluate the sensitivity of NSD-C and NSD-T algorithms towards two types of random perturbations in networks - (i) deletion of edges, which connotes missing data, and (ii) rewiring of edges, which represents noise. For this experiment, we compared networks with themselves after applying these two types of perturbations. We created G′x1 , G′x2 · · · G′xk variations of G by deleting xi % of edges from it. In the second variation, we created G′′x1 , G′′x2 · · · G′′xk by rewiring xi % edges in G such that expected degree of nodes in the network is preserved. Intuitively, both algorithms should yield pairwise distance score of 0 while comparing G with G′0 and G′′0 , with the score increasing as perturbations increase. In order to subdue the effect of randomness of the perturbations, we repeated each experiment three times with three versions of the perturbed graph and report average similarity score. Two arbitrarily chosen real-life networks (AS-1, CA-1) and two synthetic networks (FF10K, WS20K) were perturbed by deleting edges from 0% to 20% (in step size of 2) and the resultant networks were compared using the NSD-C and NSD-T algorithms. The distance of perturbed graph from G obtained by NSD-C and NSD-T algorithms for each network is plotted in Fig. 10. While, both algorithms succeed in detecting changes in the perturbed graph, algorithm NSD-C under-reacts to missing data. Algorithm NSD-T is more sensitive to changes in structure because deletion of an edge could possibly reduce support of several edges while the degree of only two nodes will be impacted by a single deleted edge. Since edge deletion causes more significant changes in truss-based hierarchy compared to core-based hierarchy, NSD-T algorithm is more sensitive to missing data than NSD-C algorithm. In the second experiment, graphs were created by rewiring edges from 0% to 20% in step size of 2. Variation in the distance of original graph from the perturbed graph is plotted in Fig. 11. For all networks, distance increases as expected, however, the increase is higher in case of NSD-T algorithm. Rise in distance in case of algorithm NSD-C is quite sluggish compared to NSD-T. This is attributed to the fact that rewiring edges is likely to break down closed triads in the network whereas the expected degree of nodes is preserved. This observation indicates the superior sensitivity of algorithm NSD-T to suitably react to noise. 6. Related works Popular approaches for comparing networks include (i) graph isomorphism, (ii) graph edit distance, (iii) network alignment and (iv) feature extraction. Graph isomorphism is a theoretically sound approach that has been traditionally employed to establish exact matching between two graphs of the same order, i.e., deciding whether two graphs are morphologically identical. Isomorphic graphs have a one-to-one correspondence between the vertices [54]. The problem is known to be NP-hard and hence has limited applicability for practical applications on contemporary large graphs. Further, exact matching is not mandated in most practical problems of network comparison, except those handling molecular structures. Therefore this line of research has not been pursued for network comparison in recent times. Graph edit distance is an alternative approach for comparing networks and has been applied to pattern analysis and recognition. This approach finds approximate matching and is essentially an error-tolerant method [55]. Commonly addressed in bio-informatics, the network alignment approach has been extensively researched by biologists. The focus of network alignment methods is to find matching or alignment between the vertices of two graphs [56]. These three approaches lead to algorithms with high computational complexity and therefore are non-scalable [57]. This deters their applicability to large networks. Feature extraction approach has recently found favour by the community interested in analysing contemporary graphs. The strategy involves extracting features from the compared graphs and computing distance between them to quantify differences. State-of-the-art network similarity methods employ the following three-step approach to quantify similarity between a pair of networks. i Feature Extraction: Map the two graphs to corresponding feature vectors derived from graph topology. Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

15

Fig. 10. Deleting edges to evaluate sensitivity of NSD-C and NSD-T using 4 network classes.

ii Feature Aggregation: Aggregate feature vectors to compose network signatures. iii Distance Computation: Compute the distance between signatures to quantify similarity between the two networks. Feature extraction based network-similarity methods can be classified broadly into two categories, namely, networkcentric and node-centric methods [1,6,18]. The former method extracts features from the entire network topology whereas the latter extracts features of nodes or their neighbourhood (1-step or k-step induced subgraph around the node). Tables 9 and 10 summarize recent feature extraction based methods for computing network similarity. 6.1. Network-centric approaches Network-centric approaches focus on network topology and employ global network features for network comparison. These algorithms summarize network structure using properties such as eigenvalues of graph Laplacian [9], density, transitivity [17], connectivity distances of nodes [59] and edge curvatures of networks under heat kernel embedding [58]. The method of Onnela et al. [18] compares diverse networks by summarizing their structure as they disintegrate into communities. All these methods besides being computationally expensive are unable to capture fine-grained topological distinctions between networks. Gallos et al. [19] propose Ntangle method that uses densities of all connected induced Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

16

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

Fig. 11. Rewiring edges to evaluate sensitivity of NSD-C and NSD-T using 4 network classes.

subgraphs as network signatures. To handle large graphs, the method resorts to Monte-Carlo sampling to select random subgraphs. As a result, the stability of results is compromised; different runs on the same network may result in different signatures. NCKD algorithm [52] is an efficient network-centric algorithm that obtains coreness distributions of nodes and edges from hierarchical graph decomposition and quantifies network distance as the Jensen Shannon distance between the distributions. Though the method is scalable, due to the simplicity of extracted features, it is unable to detect fine-grained differences between networks. 6.2. Node-centric approaches Node-centric approaches assess network structure through subgraphs, motifs, egonets or graphlets centred on nodes. Graphlets, first introduced by Przulj [3], are small connected non-isomorphic subgraphs. Many node-centric algorithms consider the distribution of graphlets in a network to capture its topology [3,4,20]. They differ in the way the count of graphlets is aggregated to create network signature and the way in which those signatures are compared. This strategy successfully meets the needs of researchers from biological sciences as graphs in this domain are relatively small in size. Wegner et al. propose graphlet-based network comparison algorithm NetEMD [12] that scales to large networks by using Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

17

Table 9 Summary of recent network-centric feature extraction based Network Similarity (NS) methods. NS method

Feature vector components

Feature aggregation method

Distance measure

Hewayda et al. [58] 2008

Edge curvatures under heat kernel embedding

N.A.

Modified Hausdorff distance

Banerjee et al. [9] 2012

Spectrum of Graph Laplacian

N.A.

Jensen Shannon Divergence

Onnela et al. [18] 2012

Community structure

Hamiltonian, Partition entropy, Number of communities

Area between MRFs

Gallos et al. [19] 2014

All possible connected induced sub-graphs

Histogram

Kolmogorov–Smirnov Distance

Saxena et al. [52] 2016

Coreness of nodes, Intraand Inter-shell edges

Probability distribution

Jensen Shannon Divergence

Attar et al. [17] 2017

Density, Transitivity, Modularity, Assortativity, Degree distribution

For degree distribution: Mean, standard deviation, minimum and maximum

Weighted Manhattan Distance

Martin et al. [4] 2017

All possible realizations of 3-node graphlets

Confusion matrix

Recall, Precision, F1, Mathew’s Correlation Coefficient

Schieber et al. [59] 2017

Probability distribution functions of connectivity distances

N.A.

Jensen Shannon Divergence

Table 10 Summary of recent node-centric feature extraction based Network Similarity (NS) methods. NS method

Feature vector components

Feature aggregation method

Distance measure

Macindoe et al. [21] 2010

Leadership, Bonding, Diversity of induced subgraph centred on node

Normalized Histogram

Earth Mover’s Distance

Pruzlj et al. [3] 2010

Graphlets attached to node

Normalized graphlet distribution

Euclidean Distance

Berlingerio et al. [13] 2013

Degree, Clustering coefficient, Average degree and clustering coefficient of neighbours, Number of edges and neighbours of node’s egonet, Number of outgoing edges from node’s egonet

Moments of distribution (Mean, median, standard deviation, skewness, kurtosis)

Canberra Distance

Ali et al. [20] 2014

Normalized counts of 3 to 5-node graphlets in 2-step egonets

Sum of normalized counts

Netdis statistics

Wegner et al. [12] 2017

Graphlets attached to node

Graphlet Degree Distribution

Earth Mover’s Distance

network subsampling. Except for [12], the compute-intensive nature of the graphlet-based approach deters its application to domains with large network sizes such as those commonly found in social networks. Macindoe et al. [21] consider all induced subgraphs of a parametrized radius centred on each vertex and compute three socially relevant structural features — Leadership, Bonding, and Diversity (L,B,D) for each subgraph. Computationally efficient approach NetSimile [13] algorithm quantifies network distance from moments of distributions of selected local topological properties derived from the 1-step neighbourhood of a node. Even though both approaches are scalable, they ignore the global patterns that emerge in networks from local level interactions. To the best of authors’ knowledge, no existing network similarity method (i) takes into account both global and local node embeddings in the network, and at the same time (ii) is scalable for large networks. Methods based on network-centric properties have limited scalability whereas methods using node-centric properties miss out the advantages that would accrue by considering entire network topology. NSD algorithm exploits network-centric properties extracted from local node embeddings to deliver a computationally feasible network similarity approach for large networks. The proposed algorithm NSD addresses the research gap between myopic node-centric network comparison approaches and computationally expensive network-centric approaches. 7. Conclusion In this paper, we present a hierarchy-based approach to detect similar networks. The proposed algorithm NSD (Network Similarity via Graph Decomposition) decomposes given networks to reveal underlying structural hierarchy. The network topology is captured using two features (i) node embedding in the hierarchy (hierarchy-level) and (ii) node interaction Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

18

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx

with different levels of hierarchy (hierarchy-affinity). Network signature is obtained using quantiles of the two hierarchybased features. Canberra distance between network signatures quantifies network similarity. NSD-C and NSD-T are two variations of NSD that use k-core and k-truss graph decomposition respectively to expose network hierarchy. Experimental evaluation affirms that hierarchy is an important topological characteristic that can be exploited for effective network discrimination. Comparison of NSD-C and NSD-T with four state-of-the-art network comparison algorithms establishes the superiority of NSD-C and NSD-T in terms of effectiveness. Execution timings for large synthetic and real-life networks affirm scalability of the hierarchical approach to network similarity. The empirical investigation also reveals that NSD-C is more scalable and effective than NSD-T but NSD-T is more sensitive to minor aberrations in the network structure. We conclude that NSD-T is appropriate for domains such as biological systems where speed can be sacrificed in favour of detecting fine-grained topological differences. However, for domains such as social, transportation and infrastructure networks, where scalability is required, NSD-C is more suitable compared to NSD-T. References [1] F. Emmert-Streib, M. Dehmer, Y. Shi, Fifty years of graph matching, network alignment and network comparison, Inform. Sci. (ISSN: 0020-0255) 346 (2016) 180–197. [2] N. Shervashidze, S.V.N. Vishwanathan, T. Petri, K. Mehlhorn, K.M. Borgwardt, Efficient graphlet kernels for large graph comparison, in: Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, 2009, pp. 488–495. [3] N. Pržulj, Biological network comparison using graphlet degree distribution, Bioinformatics (ISSN: 1367-4803) 26 (6) (2010) 853–854. [4] A.J. Martin, S. Contreras-Riquelme, C. Dominguez, T. Perez-Acle, Loto: a graphlet based method for the comparison of local topology between gene regulatory networks, PeerJ 5 (2017) e3052. [5] P. Papadimitriou, A. Dasdan, H. Garcia-Molina, Web graph similarity for anomaly detection, J. Internet Serv. Appl. 1 (1) (2010) 19–30. [6] S. Soundarajan, T. Eliassi-Rad, B. Gallagher, A guide to selecting a network similarity method, in: Proceedings of SIAM International Conference on Data Mining, ISBN: 978-1-510811-51-5, 2014, pp. 1037–1045. [7] T. Dullien, R. Rolles, Graph-based comparison of executable objects, in: Proceedings of Symposium sur la Securite Des Technologies de L’Information Et Des Communications (SSTIC), 2005. [8] S.M. Hsieh, C.C. Hsu, Graph-based representation for similarity retrieval of symbolic images, Data Knowl. Eng. 65 (3) (2008) 401–418. [9] A. Banerjee, Structural distance and evolutionary relationship of networks, Biosystems 107 (3) (2012) 186–196. [10] C. Faloutsos, D. Koutra, J.T. Vogelstein, DELTACON: A principled massive-graph similarity function, in: Proceedings of the 13th SIAM International Conference on Data Mining, 2013, pp. 162–170. [11] J. Leskovec, J. Kleinberg, C, ACM Trans. Knowl. Discov. Data 1 (1) (2007). [12] A.E. Wegner, L. Ospina-Forero, R.E. Gaunt, C.M. Deane, G. Reinert, Identifying Networks with Common Organizational Principles, ArXiv e-prints. [13] M. Berlingerio, D. Koutra, T. Eliassi-Rad, C. Faloutsos, Network similarity via multiple social theories, in: Proceedings of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2013, pp. 1439–1440. [14] S.H. Cha, Comprehensive survey on distance/similarity measures between probability density functions, Int. J. Math. Models Methods Appl. Sci. 1 (4) (2007) 300–307. [15] M. Newman, Networks: An Introduction, Oxford University Press, New York, NY, USA, 2010, ISBN: 0199206651, 9780199206650. [16] A. Kelmans, Comparison of graphs by their number of spanning trees, Discrete Math. (ISSN: 0012-365X) 16 (3) (1976) 241–261. [17] N. Attar, S. Aliakbary, Classification of complex networks based on similarity of topological network features, Chaos 27 (9) (2017). [18] J.-P. Onnela, D.J. Fenn, S. Reid, M.A. Porter, P.J. Mucha, M.D. Fricker, N.S. Jones, Taxonomies of networks from community structure, Phys. Rev. E 86 (2012) 036104. [19] L.K. Gallos, N.H. Fefferman, Revealing effective classifiers through network comparison, Europhys. Lett. 108 (3) (2014) 38001. [20] W. Ali, T. Rito, G. Reinert, F. Sun, C.M. Deane, Alignment-free protein interaction network comparison, Bioinformatics 30 (17) (2014) i430–i437. [21] O. Macindoe, W. Richards, Graph comparison using fine structure analysis, in: Proceedings of the 2nd IEEE International Conference on Social Computing, 2010, pp. 193–200. [22] H. Mengistu, J. Huizinga, J. Mouret, J. Clune, The evolutionary origins of hierarchy, PLoS Comput. Biol. 12 (6) (2016) 1–23. [23] E. Ravasz, A.L. Barabási, Hierarchical organization in complex networks, Phys. Rev. E 67 (2003). [24] E. Mones, L. Vicsek, T. Vicsek, Hierarchy measure for complex networks, PLoS One 7 (3) (2012) 1–10. [25] D. Shizuka, D.B. McDonald, The network motif architecture of dominance hierarchies, J. R. Soc. Interface 12 (105) (2015). [26] K.D. Farnsworth, L. Albantakis, T. Caruso, Unifying concepts of biological function from molecules to ecosystems, Oikos (ISSN: 1600-0706) 126 (10) (2017) 1367–1376. [27] A.S. Maiya, T.Y. Berger-Wolf, Inferring the maximum likelihood hierarchy in social networks, in: Proceedings of the 2009 International Conference on Computational Science and Engineering - Volume 04. CSE’09, ISBN: 978-0-7695-3823-5, 2009, pp. 245–250. [28] A. Clauset, C. Moore, M. Newman, Hierarchical structure and the prediction of missing links in networks, Nature 453 (2008) 98–101. [29] R. Saxena, S. Kaur, V. Bhatnagar, Social centrality using network hierarchy and community structure, Data Min. Knowl. Discov. 32 (5) (2018) 1421–1443. [30] L. Gallos, S. Havlin, M. Kitsak, F. Liljeros, H. Makse, L. Muchnik, H. Stanley, Identification of influential spreaders in complex networks, Nat. Phys. 6 (11) (2010) 888–893. [31] M.G. Rossi, F.D. Malliaros, M. Vazirgiannis, Spread it good, spread it fast: Identification of influential nodes in social networks, in: Proceedings of the 24th International Conference on World Wide Web, ISBN: 978-1-4503-3473-0, 2015, pp. 101–102. [32] M.E.J. Newman, Mixing patterns in networks, Phys. Rev. E 67 (2003). [33] M. Piraveenan, M. Prokopenko, A.Y. Zomaya, Local assortativeness in scale-free networks, Europhys. Lett. Assoc. 84 (2008). [34] M. Piraveenan, M. Prokopenko, A. Zomaya, Assortative mixing in directed biological networks, IEEE/ACM Trans. Comput. Biol. Bioinform. (ISSN: 1545-5963) 9 (1) (2011) 66–78. [35] G. Thedchanamoorthy, M. Piraveenan, D. Kasthuriratna, U. Senanayake, Node assortativity in complex networks: An alternative approach, Procedia Comput. Sci. 29 (2014) 2449–2461. [36] J.G. Foster, D.V. Foster, P. Grassberger, M. Paczuski, Edge direction and the structure of networks, Proc. Natl. Acad. Sci. 107 (24) (2010) 10815–10820. [37] M.E.J. Newman, J. Park, Why social networks are different from other types of networks, Phys. Rev. E 68 (2003). [38] M.E.J. Newman, Assortative mixing in networks, Phys. Rev. Lett. 89 (2002). [39] S.B. Seidman, Network structure and minimum degree, Social Networks 5 (1983) 269–287.

Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.

R. Saxena, S. Kaur and V. Bhatnagar / Physica A xxx (xxxx) xxx [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59]

19

J. Cohen, Trusses: Cohesive Subgraphs for Social Network Analysis, NSA:Technical report. K. Faust, Comparing social networks: Size, Adv. Methodol. Stat. 3 (2) (2006) 185–216. X. Shi, L. Adamic, M. Strauss, Networks of strong ties, Physica A (ISSN: 03784371) 378 (1) (2007) 33–47. J. Wang, J. Cheng, Truss decomposition in massive networks, Proc. VLDB Endow. 5 (9) (2012) 812–823. V. Batagelj, M. Zaveršnik, Fast algorithms for determining (generalized) core groups in social networks, Adv. Data Anal. Classif. 5 (2) (2011) 129–145. J. Leskovec, A. Krevl, SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data, 2014. A. Traud, P. Mucha, M. Porter, Social structure of facebook networks, Physica A (ISSN: 0378-4371) 391 (16) (2012) 4165–4180. V. Batagelj, A. Mrvar, Pajek datasets, URL http://vlado.fmf.uni-lj.si/pub/networks/data/, 2006. H. Jeong, B. Tombor, R. Albert, Z.N. Oltvai, A.L. Barabasi, The large-scale organization of metabolic networks, Nature (ISSN: 0028-0836) 407 (6804) (2000) 651–654. T.A. Davis, Y. Hu, The university of florida sparse matrix collection, ACM Trans. Math. Softw. (TOMS) 38 (1) (2011). R. Feenstra, R. Lipsey, H. Deng, A.C. Ma, H. Mo, World Trade Flows: 1962-2000. NBER Working Papers 11040. National Bureau of Economic Research, Inc, 2005. R.A. Rossi, N.K. Ahmed, The network data repository with interactive graph analytics and visualization, in: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, URL http://networkrepository.com. R. Saxena, S. Kaur, D. Dash, V. Bhatnagar, Leveraging structural hierarchy for scalable network comparison, in: 27th International Database and Expert Systems Applications Conference (DEXA), 2016, pp. 287–302. C.D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, Cambridge, UK, ISBN: 978-0-521-86571-5, 2008. N. Deo, Graph Theory with Applications to Engineering and Computer Science, Prentice-Hall, Upper Saddle River, NJ, USA, 1974. X. Gao, B. Xiao, D. Tao, X. Li, A survey of graph edit distance, Pattern Anal. Appl. (ISSN: 1433-7541) 13 (1) (2010) 113–129. M. Bayati, D.F. Gleich, A. Saberi, Y. Wang, Message-passing algorithms for sparse network alignment, ACM Trans. Knowl. Discov. Data (TKDD) 7 (1) (2013). S. Lu, J. Kang, W. Gong, D. Towsley, Complex network comparison using random walks, in: Proceedings of the 23rd International Conference on World Wide Web, 2014, pp. 727–730. H. ElGhawalby, E.R. Hancock, Measuring graph similarity using spectral geometry, in: Image Analysis and Recognition, ISBN: 978-3-540-69812-8, 2008, pp. 517–526. T.A. Schieber, L. Carpi, A. Díaz-Guilera. P. M. Pardalos, C. Masoller, M.G. Ravetti, Quantification of network structural dissimilarities, Nat. Commun. 8 (2017).

Please cite this article as: R. Saxena, S. Kaur and V. Bhatnagar, Identifying similar networks using structural hierarchy, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.265.