Reachability preserving compression for dynamic graph

Reachability preserving compression for dynamic graph

Information Sciences 520 (2020) 232–249 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins...

1MB Sizes 1 Downloads 57 Views

Information Sciences 520 (2020) 232–249

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Reachability preserving compression for dynamic graph Yuzhi Liang a, Chen chen a, Yukun Wang a,c, Kai Lei a,c,∗, Min Yang b, Ziyu Lyu b a

Shenzhen Key Lab for Information Centric Networking & Blockchain Technology (ICNLAB), School of Electronic and Computer Engineering (SECE), Peking University, Shenzhen 518055, PR China Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China c PCL Research Center of Networks and Communications, Peng Cheng Laboratory, Shenzhen, China b

a r t i c l e

i n f o

Article history: Received 27 August 2019 Revised 5 February 2020 Accepted 9 February 2020 Available online 10 February 2020 Keywords: Query preserving graph compression Compressed graph maintenance Dynamic graph Graph query

a b s t r a c t Reachability preserving compression generates small graphs that preserve the information only relevant to reachability queries, and the compressed graph can answer any reachability query without decompression. Existing reachability preserving compression algorithms either require a long compression time or include redundant data in the compressed graph. In this paper, we propose a novel edge sorting alogrithm for fast, non-redudant reachability preserving compression. On the other hand, we found that the current incremental reachability compression algorithms for dynamic graphs may return incorrect results in some cases. Therefore, we propose two novel incremental reachability compression algorithms. An algorithm called incremental reachability preserving compression with optimum compression ratio, which generates an update compressed graph that is exactly the same as the graph computed by recompression. The other algorithm called fast incremental reachability preserving compression, which can update the compressed graph in a short time. Extensive experiments on real datasets show the efficiency and the effectiveness of our methods. © 2020 Elsevier Inc. All rights reserved.

1. Introduction With the development of the Internet, the scale of real-life graphs are increasing dramatically [1,2,7,25]. Take online social media as an example, the monthly active users of Facebook reach 2 billion, and this number is still increasing. Querying on a large graph is costly. When the number of nodes and edges on a graph is as high as several billion, even a simple reachability query on the graph can be slow. Although using indexes can reduce query time, building and storing indexes incur extra costs. To alleviate the aforementioned problems, many studies have been proposed to compress the graphs while maintaining the primary information of the original graph. Considering that users usually conduct a certain type of queries on a graph, Fan et al. first proposed reachability preserving compression in [11]. Reachability preserving compression is a technique that compress a directed graph1 with respect to a certain class of queries, and only the information related to the query class is retained in the compressed graph. The query evaluation methods on the original graph can be directly applied to the



Corresponding author. E-mail addresses: [email protected] (Y. Liang), [email protected] (C. chen), [email protected] (Y. Wang), [email protected] (K. Lei), [email protected] (M. Yang), [email protected] (Z. Lyu). 1 An undirected graph can be converted to a directed graph by replacing each edge with two directed edges going in opposite direction. https://doi.org/10.1016/j.ins.2020.02.028 0020-0255/© 2020 Elsevier Inc. All rights reserved.

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249

233

compressed graph without decompression. Compared to traditional loss-less compression, there is less information to be maintained in query preserving compression [9,19,21]. The existing query preserving compression includes reachability preserving compression [11,43], k-reach query compression [23], and graph pattern preserving compression [11], etc. In this paper, we focus on reachability preserving compression. Reachability query is one of the most popular queries in reality, and the compression ratio of reachability preserving compression algorithm is the best among the existing query preserving compression algorithms. In [11], the authors proposed reachability preserving algorithm compressR . The algorithm compressR compresses a graph by merging the nodes in the same reachability equivalence relation into a hypernode, and then constructs the edge of the compressed graph according to the topology of the original graph. However, a problem of compressR is that it does not define the order in which edges are built during edge construction. From our study, we found that if the edges are built in a wrong order, the compressed graph may contain redundant edges. We propose a solution of this problem in this paper. On the other hand, most of the real-life graphs are dynamic in nature, but researches on incremental query preserving compression are very limited. Moreover, we found that the existing incremental rechability preserving compression method incRCM [11] may return incorrect results in some cases. In this study, we propose two novel incremental reachability preserving compression algorithms. Our contributions are as follows. 1. The current reachability preserving compression algorithm compressR is suboptimal. We improve compressR by introducing edge ordering in edge construction. With the edge ordering, the compressed graph is unique, and achieves optimum compression ratio. 2. We found that the existing incremental reachability preserving compression method incRCM may return incorrect compressed graph in some cases. We describe the problems in detail in this paper. 3. We propose two novel incremental reachability preserving compression methods, namely incRPCO and incRPCF . Both of incRPCO and incRPCF can cover special updates. The updated compressed graphs generated by incRPCO are optimum in terms of compression ratio. The algorithm incRPCF is very fast because we only modify the nodes and edges affect the reachability of the nodes. 4. We verify the effectiveness and efficiency of the proposed algorithms on real datasets. The results show that our methods significantly outperform existing methods. 2. Related work The related works can be categorized into query preserving compression, and incremental query preserving compression for dynamic graph. 2.1. Query preserving compression Query preserving compression compresses a graph relative to a certain class of query. The compressed graph can answer any query in the query class without decompression. Any query algorithm of the original graph can be directly applied to the compressed graph. Different from traditional graph compression algorithms which compress a graph by converting the graph into a compact data structure [3,20,21] query preserving compression algorithms reduce the size of the graph by defining query equivalence class, and then merging the nodes in the same query equivalence class into a hypernode. Query-friendly compression is close to query preserving compression. The general approach to query-friendly compression algorithms is to design a special data structure based on a specific query class, and then use the designed data structure to represent the target graphs [22,28,33,41]. For example, literature [22] described a data structure based on the relationship between the Eulerian path and multiposition linearization, the graph compressed based on the designed data structure can answer both out-neighbor and in-neighbor queries in sublinear time. Literature [41] used triangulation-based structure to represent the target data graph. The compressed graph is friendly to common neighbour query. There are two main differences between query-friendly compression methods and query preserving compression methods: (1) most of the query-friendly compression methods use compact data structures to store graphs, which may require (partial) decompression when querying; (2) query evaluation methods on the original graphs cannot be directly applied to the compact data structure. There are also some researches on graph-based learning methods. These methods aims to discovery new information from real-world graphs. For instance, literature [24] is to learning across multiple social networks. Current research on query preserving compression is limited. Generally, query preserving compression reduce graph size by merging nodes have identical query relations. Fan et al. first proposed query preserving compression [11], and designed two query preserving compression algorithms. One algorithm is for reachability query, the other is for graph pattern query. The reachbility preserving compression algorithm compressR compress a graph by merging nodes have the same set of ancestors and dependents into a hypernode. The graph pattern perserving compression algorithm in [11] compress a graph by merging bisimilar nodes [32]. DAG-reduction can also be used for reachability preserving compression. It reduces the size of the graph by performing equivalence reduction based on the result of transitive reduction [43]. Other query preserving compressions include k-reach preserving compression [23], maximum Steiner connected k-core preserving compression [18], and lossy distance preserving compression Shrink [31], etc. The main difference between k-reach preserving compression and maximum Steiner connected k-core preserving compression lies in how to define the query equivalence class. In

234

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249 Table 1 Key notations. Notation

Description

u, v e = (u, v ) p(u, v) V E G Gr G QR QR Re [v]Re [v]SCC A (v ) D (v ) P (v ) C (v )

nodes in a graph an edge from node u to node v a path from node u to node v a finate set of nodes a finate set of edges the original graph the compressed graph the updated part A reachability query function on the original graph A reachability query function on the compressed graph a reachability equivalence relation the equivalence class/hypernode in relation Re containing node v the strongly connected component containing node v the set of ancestors of v the set of descendents of v the set of parents of v the set of children of v

k-reach preserving compression, two nodes are equivalent if their neighbor sets are identical, while in maximum Steiner connected k-core preserving compression, two nodes are equivalent if they can reach each other and their core values are the same. Shrink can compress a graph to any size and retain their distance information as much as possible, but when the compression ratio becomes higher, the query accuracy on the compressed graph will be affected. 2.2. Compressed graph maintainence Real-life graphs are often updated frequently [37]. Researchers have developed some methods related to compressed graph maintenance. Firstly, some works designed special data structures to support dynamic graphs. For example, Brodal and Fagerberg designed a representation based on the adjacency list that supports edge updates [6]. Iverson and Karypis describe several ways to maintain dynamic graphs with five data structures [13]. The advantages and disadvantages of the five data structures are analyzed in depth in the article. However, graphs generated by the above two methods do not support reachability queries. Although we can preserve node reachability information of graphs by constructing transitive closures of the graphs [8,12,30], building and maintaining transitive closures incur extra costs. Alternatively, we can build an index of the compressed graph. For example, [27] represented a graph by using a adjacency trunk of array to store all incoming/outgoing nodes of a node, and using an index array to store the starting postilion of trunk in incoming/outgoing arrays. When an edge is added in the graph, the graph representation can be updated by preallocating a bucket or moving all incoming/outgoing of two vertices (of the adding edge) to the end of incoming/outgoing edge. When an edge is deleted in the graph, the graph representation can be updated by updating the number of incoming/outgoing in the index arrays and removing the associated vertex in the incoming/outgoing edge lists. DAGGER is a scalable index for reachability queries [39]. DAGGER uses relaxed interval labeling to enable dynamic graph indexing. Bramandia proposed an indexing method that supports incremental maintenance of a 2-hop label [5]. Literature [26] proposed a fully-dynamic index data structure designed for influence analysis on evolving networks. They used a sketch-based method named Reverse Influence Sampling (RIS) [4], and constructed a corresponding update algorithm. There are some methods designed for preserving primary information of dynamic graphs. A method for maintaining a minimum spanning forest of a dynamic planar graph is proposed in [10]. The updates of graphs can be changes in the edge weights, the insertion and deletion of edges and vertices. Khan and Aggarwal proposed gmatrix to summarize the graph stream in real time while preserving the primary topology of the graph [14,15]. gmatrix can estimate some structural properties, such as the reachability of two nodes via frequently visited edges. Literature [40] designed a cluster-preserving node replacement strategy to preserve the inherent clustering structure in streaming graphs. Literature [11] proposed an algorithm incRCM for maintaining query preserving compressed graph. However, we found that incRCM may have problems with special updates. We will further discuss the problems in the following sections. 3. Preliminary In this section, we introduce some terminologies and state our problem definition. 3.1. Notation and terminology Firstly, We introduce the key notations used in this paper in Table 1. Secondly, we describe some terminologies used in this paper.

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249

235

Fig. 1. Workflow of query preserving compression.

Definition 1. Graphs: A directed graph G = (V, E ) is consists of a set of nodes V and a set of edges E⊆V × V, where (u, v) ∈ E denotes a directed edge from node u to v. Definition 2. Reachability graph queries: A reachability query [29,36,42] between u and v on a directed graph, denoted by QR (u, v), is a query that asks whether there exists a path from node u to node v in the graph. 3.2. Problem statement Problem 1. Reachability preserving graph compression A reachability preserving compression can be defined by a triple < R, F, P > . In the triple, R =< RV, RE > is a reachability preserving compression function consists of the compression strategy of nodes RV and that of edges RE, F : QR → QR is a reachability query rewriting function, and P is a post-processing function. The compressed graph Gr generated by R should satisfy: (1) the reachability query result on the original directed graph G is consistent with the post-processed rewrite reachability query on the compressed graph Gr , i.e. QR (G ) = P (QR (Gr )), QR = F (QR ) (2) any reachability query algorithm can be directly used to compute QR (Gr ) without decompressing Gr . The relations of QR , QR , G, and Gr is demonstrated in Fig. 1. Problem 2. Incremental reachability preserving compression Incremental reachability preserving compression is to maintain the compressed graph Gr = R(G ) when the original graph G is updated. Formally, for a compressed graph Gr = R(G ) and a set of updates (including edge insertion and edge deletion) G, incremental compression computes an updated compressed graph Gr = Gr  Gr which satisfies: (1) the updated Gr is identical to R(GG), (2) updating Gr is faster than re-compress the updated G from the sketch. 4. Edge sorting for rechability preserving compression The existing reachability preserving compression compressR [11] compresses a directed graph by merging the nodes which have the same set of ancestors and the same set of descendants, and then constructing the edges of the compressed graph according to the topology of the original graph. The time complexity of compressR is O(|V |(|V | + |E | )). However, the algorithm compressR does not define the construction order of the edges. If the edges are constructed in a random order, the compressed graph may contain redundant edges. For example, in Fig. 2, there are three edges (d, e), (c, e), (c, d) in the subgraph {c, d, e} of G. During edge construction, if the edges in the original graph are visited in order ’(c, e), (d, e), (c, d)’ or ’(d, e), (c, e), (c, d)’, all of the three edges will be added to the compressed graph. Nonetheless, if the edges in the original graph are visited in order ’(c, d), (d, e), (c, e)’, we will find that the edge (c, e) is redundant because node c can reach node e via path (c, d), (d, e) in the compressed graph already. To solve this problem, we define an edge sorting scheme for reachability preserving compression. We first introduce some notations. Node topological rank: The topological rank of nodes is defined in [11]. Let r(v) denotes the topological rank of node v, r(v) can be calculated as follows: (1) if v has no children, r (v ) = 0; (2) if node v and node u in the same strongly connected component [35], r (v ) = r (u ); (3) assume s range over the children of s, then r(s) = max(r(s )) + 1. Based on node topological rank, we define edge order as follows. Edge order: Given two edges e0 = (u0 , v0 ), e1 = (u1 , v1 ), the order of e0 and e1 ,denoted as o(e0 ) and o(e1 ), is defined as follows: (1) if r(u0 ) < r(u1 ), then o(e0 ) < o(e1 ); (2) if r (u0 ) = r (u1 ) and r(v0 ) > r(v1 ), then o(e0 ) < o(e1 ); (3) if r (u0 ) = r (u1 ) and r (v0 ) = r (v1 ), then o(e0 ) = o(e1 ); (4) otherwise, o(e0 ) > o(e1 ). The designed edge order has following property.

236

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249

Fig. 2. Example of reachability preserving graph compression.

Lemma 1. If edge (u, v) in a directed acyclic graph is redundant to maintain reachability, the order of edge (u, v) is larger than all of the non-redundant edges from node u to node v. Suppose the edge (u, v) is redundant for maintaining node reachability, there must exist a path from node u to node v, which consists of multiple non-redundant edges. Let x be a node on the path from u to v, the node topological rank of u, v, x is r(v) < r(x) < r(u), thus the edge order of (u, v), (x, v) is o(u, v) > o(x, v). The lemma is proved. Based on the property of edge order and compressR , we propose a query preserving compression algorithm optCompR in Algorithm 1. The process can be divided into three steps: Algorithm 1 Algorithm optCompR for reachability. Input: A graph G = (V, E ); Output: A compressed graph Gr = R(G ) = (Vr , Er ). 1: process G by merging each strongly connected component in G into a single node. 2: set Vr = ∅, Er = ∅; 3: compute reachability equivalence relation Re 4: compute the partition Par = V/Re of G; 5: for each S ∈ Par do create a new node vs ; Vr = Vr ∪ {vs }; 6: Sort E by their edge order ascending for each (v0 , v1 ) ∈ E do if [v0 ]Re does not reach [v1 ]Re then 9: 10: Er = Er ∪ { ( [v0 ]Re , [v1 ]Re )}; 7:

8:

11:

return Gr = (Vr , Er );

1. Merge stongly connected nodes (line 1). As in compressR , we first collapse each strongly connected component into one node. Since each node in a SCC can reach other nodes in the same SCC, no reachability information is lost in the processed graph. 2. Merge reachability equivalence nodes (line 2–6). Merging the nodes in the same reachability equivalence class Re . 3. Edge construction (line 7–6). Sorting the edges in GSCC by the edge order ascending. For each edge (u, v) in the sorted edge list,if [u]Re cannot reach [v]Re in the compressed graph, add an edge between [u]Re and [v]Re . Time complexity: Algorithm optCompR indeed computes the compressed graph Gr in O(|V |(|V | + |E | )) time. Specifically, merging each strongly component into one node using Kosaraju’s algorithm [34] is in O(|V | + |E | ) time, Re and Par can be computed in O(|V |(|V | + |E | )) time, computing edge ranks takes O(|V | + |E | ) time, sorting the edges is in O(|E|) time, and edge construction of Gr can be done in O(|V |(|Vr | + |Er | )) time. Property: The compressed graph generated by OptCompR is optimum in terms of compression ratio for maintaining reachability information of the graph. Proof. We can prove the property by contradiction. If the compressed graph generated by optCompR is not optimum in terms of compression ratio, then at least one of the following happens: (1) The compressed graph has redundant nodes for preserving reachability of the graph. (2) The compressed graph has redundant edges for preserving reachability of the graph. If case (1) happens, there exists a reachability preserving compressed graph Gr has fewer nodes than the compressed graph Gr generated by optCompR . In this case, there must exists a hypernode in Gr contains two nodes u, v which have different ancestors/descendants. Assume a node x is an ancestor of u but not an ancestor of v, then the reachability of x

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249

237

Fig. 3. Not all of the affected nodes can be identified by incRCM− .

and u is different from the reachability of x and v. Therefore, node u and v are not reachability equivalent, and cannot be merged into one hypernode in reachability preserving compression. For case (2), since necessary edges will be added before checking the redundant edges, constructing edges according to the sorted edge list can be considered as a process of building a minimum spanning tree. Therefore, compressed graphs do not contain redundant edges.  Example 4.1. In Fig. 2, the edge order of (d, e), (c, e), (c, d) in the subgraph {c, d, e} of graph G is o((d, e)) < o((c, d)) < o((c, d)). The edge (d, e) and (c, d) will be constructed before checking the reachability of node c and node e, therefore redundant edge (c, e) will not be added to the compressed graph since node c can reach e already. 5. Incremental reachability graph compression To cope with the dynamic nature of social networks and Web graphs, it is necessary to develop methods for compressed graph maintenance. For an original graph G and a set of updates G to G, incremental compression algorithms calculate Gr based on G and Gr to ensure P (QR (Gr  Gr )) = QR (G  G ), i.e., the post-processed reachability query results on the updated compressed graph is identical to the reachability query results on the updated graph. There is limited research on incremental query preserving compression, and the only existing incremental reachability preserving compression method is incRCM in [11]. Algorithm incRCM is composed of the incremental reachability preserving compression algorithm for edge insertion incRCM+ and the algorithm for edge deletion incRCM− . Nevertheless, from our observation, we found that incRCM has following problems. Firstly, incRCM− may return incorrect compression results. The reasons are as follows. 1. Not all of the affected nodes can be identified by incRCM− . When an edge (v0 , v1 ) is deleted, incRCM− only considers [v0 ]Re , [v1 ]Re , parents of [v0 ]Re , children of [v1 ]Re and the nodes having the same parents as [v0 ]Re or [v1 ]Re . This might lead to incorrect compression results. Example 5.1. Assume the original graph is G in Fig. 3(a), the compressed graph Gr of G is as Fig. 3(b). When the edge to be deleted is (a, b), only [a]Re , [b]Re , and [c]Re will be considered in incRCM− , and the updated graph G will be the same as Gr in Fig. 3(a). However, if we compress the updated G from the sketch, we will find the updated graph should be the graph in 3(d), node d, e are also affected by the update. Obviously, the reachability relationship in 3(b) is different from that in 3(b), so the update result returned by incRCM− is incorrect. 2. In some cases, incremental reachability compression for edge deletion need to add edge in the compressed graph to preserve the rechability relations. This process is not included in incRCM. Example 5.2. Assume the graph is as the G in 4(a), then the compressed graph Gr is as 4(b). When edge (d, c) is deleted, we need to add an edge from a to d to maintain the reachability of the updated graph, and the updated compressed graph should be that in 4(d). In incRCM, it will only delete the edge (d, c) in Gr , thus the returned result is incorrect. Secondly, the compression results returned by incRCM+ may be suboptimal in terms of compression ratio. The reasons are as follows. 1. Not all of the affected nodes can be identified by incRCM in edge insertion. Assume the edge to be inserted is (u, u ), if r ([u]Re ) > r ([u ]Re ) in the updated graph, incRCM will merely merge the nodes with u sharing the same parents with [u]Re , and so it is for [u ]Re . However, from our study, we find other nodes might be affected as well.

238

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249

Fig. 4. Edge construction for preserving reachability.

Fig. 5. Not all of the affected nodes can be identified.

Fig. 6. Propagation of merging.

Example 5.3. Assume the original graph G is as Fig. 5(a) and the edge to be inserted is (b, d). The compressed graph Gr = R(G ) is as Fig. 5(b), node d is a leaf node in the updated G while node b is not, thus r ([b]Re ) > r ([d]Re ). Since there is no node sharing the same parents with [b]Re or [d]Re , the updated compressed graph should be identical to G according to incRCM. Nevertheless, the insertion of (b, d) makes node c have the same reachability property with node d, c and d should be merged into one reachability equivalence class (as Fig. 5(d)). In other words, node c is affected by the update but not identified by incRCM. As a consequent, the updated compressed graph is not identical to compress the updated graph from scratch. 2. For an edge insertion, some of the ancestors/descents of the two terminals should be merged. Example 5.4. Assume G is as Fig. 6(a) then the compressed graph of G is as Fig. 5(b) When the edge to be inserted is (a, b), according to incRCM, there is no node in G sharing the same parents with a or b, thus the updated compressed graph is identical to Gr in Fig. 5(b). Nonetheless, the insertion of (a, b) makes the reachability properties of c is the same as that of d which means c and d should be merged into one hypernode in the updated compressed graph. In this paper, we propose two novel incremental reachability preserving compression algorithms. The two algorithms are named incremental reachability preserving compression with optimum compression ratio and fast incremental reachability preserving compression.

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249

239

5.1. Incremental reachbaility preserving compression with optimum compression ratio In this section, we present the incremental reachability preserving compression with optimum compression ratio. The algorithm, denoted as incRPCO , can return an updated compressed graph which is optimum in terms of compression ratio. For a specific data graph, the updated compressed graph computed by incRPCO is identical to re-compress the updated graph from sketch, i.e., Gr  Gr = R(G  G ). Firstly, we introduce two lemmas which are the basis of incRPCO . Lemma 2. If deleting an edge (v0 , v1 ) in a directed acyclic graph makes some nodes in a reachability equivalence class reachable by v0 and other nodes in the same class cannot, then the nodes reachable by v0 are children of v0 . Proof. The compressed graph is a directed acyclic graph. There are two types of hypernodes in this compressed graph:(1) hypernodes that cannot be reached by v0 ; (2) hypernodes that can be reached by v0 . The hypernodes that cannot be reached by v0 are not reachable by v1 as well. Therefore, deleting (v0 , v1 ) will not affect the reachability of nodes in the hypernodes that cannot be reached by v0 . That means if a hypernode [v]Re needs to be split after deleting (v0 , v1 ), it must be reachable by v0 . After the deletion, if v0 can reach some nodes in [v]Re and cannot reach other nodes, then the nodes that can be reached by v0 must be directly connected to v0 (i.e., they are children of v0 ). If there is a node v that can be reached by v0 but is not directly connected to v0 , then there be at least one node on the path p(v0 , v ). We use y to represent this node. Since y can reach y , then y can reach nodes in the same equivalence class of y (i.e., [v]Re ). In this case, after deleting (v0 , v1 ), v0 can reach all nodes in [v]Re through y, which is contrary to the assumption.  Lemma 3. If deleting edge (v0 , v1 ) in a directed acyclic graph makes some nodes in an reachability equivalence class can reach v1 while other nodes in the same reachability equivalence class cannot, all the nodes can reach v1 are parents of v1 . Proof. The compressed graph is a directed acyclic graph. There are two types of hypernodes in this compressed graph:(1) hypernodes that can reach v1 ; (2) hypernodes that cannot reach v1 . The hypernodes that cannot reach v1 are not able to reach v0 as well. Therefore, deleting (v0 , v1 ) will not affect the reachability of the nodes in the hypernodes that cannot reach v0 . That means if a hypernode [v]Re needs to be split after deleting (v0 , v1 ), it must can reach v1 . After the deletion, if some nodes in [v]Re can reach v1 , while other nodes cannot, then any node v that can reach v1 must be directly connected to v1 (i.e., v is the parent of v1 ). This is because if there is a node y on the path p(v , v1 ), v can reach y meaning that all nodes in the same equivalence class of v (i.e.,[v]Re ) can reach y. In this case, after deleting (v0 , v1 ), all nodes in [v]Re can reach v1 through y, which is contrary to the assumption.  As in [11], we first merging each strongly connected component in G into one node. Since each node in a SCC can reach other nodes in the same SCC, the reachability relations in G are not changed by the process. The processed G is a directed acyclic graph. The main idea of incRPCO is to identify the affected area and the areas related to the affected area, and then update the hypernodes and edges between the hypernodes in the affected area according to the information from the relevent areas. From Lemmas 2 and 3, we know edge deletion only affects the reachability equivalence class of the two terminals in one hop. Assume the edge to be updated is (v0 , v1 ), we define three areas as follows. • Reachability Affected Area (RAA): RAA consists of nodes v0 , v1 , the parent nodes of v0 (denoted as P (  )), the children of v1 (denoted as C (v1 )), and the edges between the nodes. The update only affects the reachability of nodes in RAA. • Equivalence Class Related Area (ERA): ERA consists of nodes in [v0 ]Re , [v1 ]Re , child nodes of [v0 ]Re (denoted by C ([v0 ]Re ), parent nodes of [v1 ]Re (denoted by P ([v1 ]Re )) except the nodes in RAA. The update does not affect the reachability of the nodes in ERA, but some equivalence classes in the ERA need to be reassigned, otherwiese, the updated compressed graph may contain redundant nodes. • Propagration Area (PPA): PPA consists of the ancestors and descendents of the nodes in RAA and ERA. PPA is the area related to EEA and RAA in the update. Fig. 7 is an example of the three areas. According to Lemmas 2 and 3, only the nodes in RAA affects the reachability relations, but we need ERA and PPA to reassigning the nodes in ERA to their reachability equivalence class. Therefore, in the updated compressed graph, only ondes and edges in RAA will be updated, that of ERA and PPA will be unchanged. The process of incRPCO is listed in Algorithm 2. It can be divided into three steps. 1. Preprocessing (line 2–5). The algorithm first identify whether the update is redundant. 2. Auxiliary graph construction (line 8–25). It then constructs an auxiliary graph which consists of the affected area and relevant areas. Intuitively, we can identify the RAA, ERA and PPA from the updated G. As mentioned above, the reachability of nodes in ERA and PPA are not affected by the update. Since the reachability of Gr is identical to that of G, we can use ERA and PPA in Gr instead of those in G to make the auxiliary graph smaller. 3. Updating (line 27–41). Compress the auxiliary graph by OptCompR , and then replace the affected subgraph in Gr with the compressed auxiliary graph. Time complexity: The algorithm incRPCO is in O(|V |(|V | + |E | )) time. Specifically, constructing the auxiliary graph is in O(|G||V | + |E | ) time and compress the auxiliary graph is less than O(|V |(|V | + |E | )).

240

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249

Fig. 7. Auxiliary graph.

5.2. Fast reachability preserving compression The algorithm incRPCO can generate an updated reachability preserving compressed graph that is identical to re-compress the updated graph from sketch, and is optimum in terms of compression ratio. However, sometimes we only require the updated graph to return the correct reachability query results, regardless of whether the updated compressed graph is optimum at compression ratio. In this paper, we further design fast incremental reachability preserving compression, denoted as incRPCF , which only to update the part that affects the reachability query results. In incRPCF , we design incremental compression for edge insertion and edge deletion separately. 5.2.1. Edge insertion We first define fast reachability preserving compression for edge insertion. The algorithm is denoted as incRPCF+ , and the process is listed in Algorithm 3. Assume the edge to be deleted is (v0 , v1 ), the process of incRPCF+ can be divided into the following three steps. 1. Preprocessing (line 1–1). Preprocessing is to define whether the inserted edge affects the reachability relationship of the compressed graph. 2. Node splitting (line 3–4). If the update is not redudant, there is no path from [v1 ]Re to [v2 ]Re before the insertion. In this case, the reachability relationships of nodes v0 , v1 are different from other nodes in [v0 ]Re , [v1 ]Re after the update, so we need to split v0 , v1 from their original reachability equivalence class. 3. Edge Construction (line 5–10). Constructing edges for the newly formed hypernodes of u and v Time complexity: The time complexity of incRPCF+ is |Vr | + |Er | for an edge, and |G|(|Vr | + |Er | ) for a batch of edges G . 5.2.2. Edge deletion Different from edge insertion, edge deletion might break existing graph topology. Edge deletion can be divided into two cases according to whether the deletion destroy a strongly connected component. From Fig. 4, we know when edges are deleted from the original graph, the update of the compressed graph may involve edge-adding operations. From our study, we find the following lemma about the edge-adding operations. Lemma 4. After deleting edge (v0 , v1 ), if any edge needs to be added on the compressed graph to preserve node reachabilities, the start of the edge must be an ancestor of v0 , and the end of the edge must be a descendent of v1 . Proof. Assume the deleted edge is (v0 , v1 ), only the reachability of nodes in A(v0 ) and D (v1 ) will be affected. Edge construction is required only when nodes in A(v0 ) can reach D (v1 ) in the updated G but not in the updated Gr . Hence, if any edge is to be added, its start must in A(v0 ), and its end must in D (v1 ) We design fast incremental query preserving compression for edge deletion based on Lemma 4. The algorithm is denoted as incRPC − , and the process is listed in Algorithm 4.  The process can be divided into the following three steps. 1. Preprocessing (line 2–2). Preprocessing is to identify whether the deletion affects the reachability relations in the compressed graph.

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249

241

Algorithm 2 Algorithm incRPCO . Input: original graph G = (V, E ), compressed graph Gr = (Vr , Er ), update subgraph G Output: updated Gr 1: for each updated edge (v0 , v1 ) ∈ G do if (v0 , v1 ) to be added and [v0 ]Re can reach [v1 ]Re in Gr then 2: 3: G = G − ( v 0 , v 1 ) ; else if (v0 , v1 ) to be deleted and v0 can reach v1 in updated G then 4: G = G − ( v 0 , v 1 ) 5: 6: 7: 8: 9: 10: 11:

VRAA , VERA , VPPA = ∅; for each (v0 , v1 ) ∈ G do //identify the nodes in RAA for each v ∈ ({v0 } ∪ {v1 } ∪ P (v0 ) ∪ C (v0 )) do if v ∈ / VRAA then VRAA = VRAA ∪ {v};

//identify the nodes in ERA for each v ∈ ({[v0 ]Re } ∪ {[v1 ]Re } ∪ C ([v0 ]Re ) ∪ P ([v1 ]Re )) do 14: if v ∈ / VRAA and v ∈ / VERA then VERA = VERA ∪ {v}; 15:

12:

13:

//identify the nodes in PPA for each v ∈ (VRAA ∪ VERA ) do VPPA = VPPA ∪ A(v ) ∪ D (v ); 18:

16: 17:

19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31:

//build the auxiliary graph VA , EA = ∅; VA = VRAA ∪ VERA ∪ VPPA ; for each (u, v ) ∈ E do if u ∈ VA and v ∈ VA then EA = EA ∪ {(u, v )}; GA = (VA , EA ); //construct the compressed graph GrA = R(GA ); Vr , Er = ∅ for each hypernode v ∈ Vr do if hypernode v does not contains nodes in VA then Vr = Vr ∪ {v}

for each (u, v ) ∈ Er do if u, v ∈ Vr then Er = Er ∪ {(u, v )} 34:

32: 33:

for each (u, v ) ∈ Er do if u ∈ VrA and v ∈ Vr then Er = Er ∪ {(u, v )} 37: 38: else if v ∈ VrA and u ∈ Vr then Er = Er ∪ {(u, v )} 39: 35:

36:

Vr = Vr ∪ VrA ; Er = Er ∪ ErA ; 42: return updated Gr = (Vr , Er );

40: 41:

2. Node splitting (line 3–5). If the deletion is not redudant, the reachability relations of the two terminals will be different from other nodes in the same reachability equivalence class. Hence, we need to split the two terminals from their original hypernodes. 3. Edge construction (line 6–23). Edge construction is adding edges according to Lemma 4. It is worth noticing whether the deletion destroy a SCC will affect the edge construction process. If the deletion destroy a SCC, i.e, the two terminals of the deleted edge are in the same SCC, we need to consider all of the nodes in the SCC in edge construction. Time complexity: For deleting an edge (v0 , v1 ), if v0 and v1 are not in the same SCC, let K denotes the number of edges from A(v0 ), then the time complexity of incRPCF− is K (|D(v1 )| + |E | ). If v0 and v1 are in the same SCC, let M, N denotes the number of edges from A([v0 ]SCC ) and C([v0 ]SCC ), then the time complexity of incRPCF− is M (N + |Er | ). Example 5.5. An example of incRPCF is in Fig. 8. Suppose the original graph is G in Fig. 8(a), GSCC is the graph generated by merging each strongly connected component in G into one node. GSCC is shown in Fig. 8(b), Gr of G is. When (h, e) is to be

242

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249

Algorithm 3 Algorithm incRPCF+ . Input: A compressed graph Gr = (Vr , Er ), an edge to be inserted (v0 , v1 ); Output: An updated compressed graph Gr . 1: if [u]Re can reach [v]Re then 2: return Gr = (Vr , Er ) 3: else Vr = nodeSplit (Vr , u ); 4: 5: Vr = nodeSplit (Vr , v ); for u ∈ {v0 , v1 } do 6: for v ∈ C ([u]Re ) do 7: Er = Er ∪ { ( v, u )} 8: 9: 10: 11: 12:

1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

for u ∈ P ([v]Re ) do Er = Er ∪ {(u, v )}

Er = Er ∪ { ( v0 , v1 )} return Gr = (Vr , Er )

procedure nodeSplit(Vr , v) if |[v]Re | = 1 then return Vr else Vr = Vr − [v]Re [vrest ]Re = [v]Re − {v} create a new hypernode [v]Re = {v} Vr = Vr ∪ [vrest ]Re Vr = Vr ∪ [v]Re return Vr

Algorithm 4 Algorithm incRPC − . Input: G = (V, E ), Gr = (Vr , Er ), an edge to be deleted (v0 , v1 ), strongly connected components in G Output: Updated Gr 1: E = E − { (v0 , v1 )} 2: if then[v0 ]Re still can reach [v1 ]Re in G return Gr = (Vr , Er ) 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

Vr = nodeSplit (Vr , v0 ) Vr = nodeSplit (Vr , v1 ) Er = Er − { ( v0 , v1 )} if v0 and v1 are not in the same SCC then for each v in A(v0 ) do for each u in C (v ) do if u ∈ D (v1 ) and v can reach u in G then if [v]Re cannot reach [u]Re in Gr then Er = Er ∪ { ( [v] Re , [u] Re )}

return Gr = (Vr , Er ) else 15: v0 and v1 are in the same SCC if [v0 ]Re contains a node other than [v0 ]SCC then 16: 17: return Gr = (Vr , Er )

13:

14:

18: 19: 20: 21: 22: 23: 24: 25: 26: 27:

A = φ, D = φ for each v in [v0 ]SCC do A = A ∪ A (v ) D = D ∪ D (v )

for each v in A do for each u in C (v ) do if u ∈ D and v can reach u in G then if [v]Re cannot reach [u]Re in Gr then Er = Er ∪ { ( [v] Re , [u] Re )} return Gr = (Vr , Er )

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249

243

Fig. 8. Example of incRPCF .

deleted, obviously, it is not an redudant update, we need to split node hSCC and node eSCC from their original hypernodes, and delete ([h]Re , [e]Re ) in the compressed graph. Moreover, we need to add edge ([hscc ]Re , [a]Re ) and ([hSCC ]Re , [b]Re ) in the edge construction step. Note the reachability relationship of Fig. 8(d) is identical to the last graph in Fig. 8(e), but the last graph in Fig. 8(e) has more nodes and edges. 6. Experiment In the experiment, we used real-life graphs of different sizes and different topologies as our datasets. We evaluate the performance of optCompR from compression ratio and compression time, and the performance of incRPCO and incRPCF from the update time and query processing time. All algorithms were implemented in python, and the experiments were run on a Linux machine powered by two Intel(R) Xeon(R) E5640 CPUs and 48GB of memory. 6.1. Datasets The datasets used in our experiments are listed in Table 2. The description of the datasets is as follows. • xmark [38]: XML documents. • Simple Wiki-Web2 : A graph from Wikipedia. A node in the graph represents an entry in Wikipedia, and an edge in the graph represent a hyperlink between two entries. We use the records from September 10, 2001 to January 1, to 2

https://dumps.wikimedia.org/.

244

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249 Table 2 Datasets. dataset

|V|

|E|

Size

Topology

xmark Simple Wiki-Web Wiki-Vote p2p-Gnutella31 Cit-HepTh web-Stanford

6080 17,872 7115 62,586 27,770 281,903

7025 87,931 103,689 147,892 352,807 2,312,497

Small Small Small Large Large Large

Sparse Sparse Dense Sparse Dense Dense

Table 3 Compression ratio comparison. Dataset

xmark Simple Wiki-Web Wiki-Vote p2p-Gnutella31 Cit-HepTh web-Stanford

• • • •

compressR

OptCompR

|Vr |

|E r |

|Vr |

|Er |

3392 2276 1016 4866 18,820 5368

4073 2820 1664 7728 103,070 8201

3392 2276 1016 4866 18,820 5368

4002 2421 1094 5666 37,118 6226

construct the original graph, and then use the change of the records from January 1, 2006 to January 10, 2006 as the graph updates. Wiki-Vote [17]: Wikipedia administrator elections and vote history data according to a dump of Wikipedia. p2p-Gnutella31 [16]: A sequence of snapshots of the Gnutella peer-to-peer file sharing network. A node in the graph represents a host in the network, and an edge in the graph represent a connection between two hosts. Cit-HepTh [17]: A citation graph from arXiv. Each paper in arXiv is represented as a node, and if a paper i in arXiv cites paper j in arXiv, the graph contains a directed edge from i to j. web-Stanford [17]: Webpages from Stanford University (stanford.edu). A node in the graph represent a webpage, and an edge in the graph represents a hyperlink between two webpages.

6.2. Experiment of reachability preserving compression In this section, we evaluate the performance of OptCompR from compression ratio and compression time. Algorithm compressR [11] and DAG − reduction [43] are used as the baselines. Graph compression ratio is the size of the compressed graph divided by the size of the original graph

RC =

|Vr | + |Er | |V | + |E |

(1)

If a reachability graph compression algorithm is optimum in terms of compression ratio, the compressed graph generated by the algorithm has the minimal number of nodes and edges to preserve the reachability information of the orignial graph. The compression graphs generated by OptCompR and DAG − reduction are identical and have optimum compression ratio, while the compression graphs generated by compressR contain redundant edges. Therefore, in our experiments, only the compression ratios of the algorithms compressR and OptCompR are compared, and the result is in Table 3.3 The main difference between OptCompR and compressR is that OptCompR uses the edge sorting scheme we proposed to sort the edges before constructing the edges in the compressed graph, while compressR constructs the edges in the compressed graph in a random order. From Table 3, we have the following observations. 1. As expected, the compressed graphs generated by OptCompR have fewer edges than those generated by compressR , especially when the original graph is dense. For the dense graphs Wiki-Vote, Cit-HepTh, and WebStanford in the table, the number of edges in the graph compressed with OptCompR is 34.25%, 63.99% and 24.08% less than the graph compressed with compressR . This is because in a dense graph, there are usually multiple paths between two nodes. In this case, if the construction order of edges is not considered when generating the compressed graph, redundant edges are likely to be generated in the compressed graph. 2. On the xmark dataset, the difference between the graph compressed with OptCompR and the graph compressed with compressR is not significant. This is because xmark is a sparse dataset with only 6080 nodes and 7025 edges. The average degree of the nodes in the graph is only 1.16. This means that there are very few paths between two connected nodes in the graph, so the construction order of edges has limited effect on the compression result. 3 The results generated by out implementation of compressR are different from those reported in [11] on some datasets. Our code is avaliable at https: //github.com/PapersCode/Reachability.

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249

245

Table 4 Compression time comparison. Dataset

xmark Simple Wiki-Web Wiki-Vote p2p-Gnutella31 Cit-HepTh web-Stanford

Compression time (s) optCompR

DAG − reduction

2.24 47.93 11.84 65.08 265.79 335.46

15.65 263.02 5.58 3546.98 960.53 685.79

3. In the process of graph compression, the order of edge construction has a more significant impact on large graphs than on small graphs. For example, the average vertex degree of Wiki-Vote and Cit-HepTh are similar. When we compress Cit-HepTh, using OptCompR can reduce the edges by 63.99% compared to using compressR . However, on the dataset Wiki-Vote, the number of edges of the compressed graph generated by OptCompR is only 34.25% less than the number of edges of the graph compressed with compressR . This is because when a large graph and a small graph have the same average vertex degree, the average number of paths between two nodes on the large graph is larger than that on the small graph. In this case, the effect of sorting edges before constructing compressed graph edges is more significant. In the compression time comparison, we evaluated the time for compress graphs by OptCompR and DAG − reduction. The results are listed in Table 44 , and our findings are as follows. 1. Although the compression results of OptCompR and DAG − reduction are the same, OptCompR is remarkably faster than DAG − reduction on most of the graphs, especially on sparse graphs. This is because DAG − reduction generates a compressed graph by computing a transitive reduction of the original graph followed by computing equivalence reduction. When the original graph is sparse, the transitive reduction cannot significantly reduce the size of the graph but incurs extra costs. 2. An interesting phenomena is that OptCompR is slower than DAG-reduction on dataset Wiki-Vote. This is because WikiVote is a voting network, in which 66.51% of nodes have no incoming edge and 14.13% of nodes have no outgoing edges. In this case, computing transitive reduction is fast and can significantly reduce the size of the graph. 6.3. Experiment of incremental reachability preserving compression In the experiment of incremental reachability preserving compression, we compared the performance of incRPCO , incRPCF , and compress the updated graph from the sketch by OptCompR on 4 datasets of different size and topology. Concretely, we used the small-dense graph Wiki-Vote, the large-sparse graph p2p-Gnutella31, and the large-dense graph Cit-HepTh as our experimental data. These three graphs are static graphs. In the experiment, we randomly added / deleted some edges to the original graph, and then updated the compressed graph. The strengths of incRPCO and incRPCF are different. The algorithm incRPCO is slower than incRPCF , but the updated graph has no redundant edges. Therefore, we evaluated the performance of the algorithms from different aspects, namely the update time, the size of the updated compressed graph, and the query processing time on the compressed graph. We further conducted our experiment on a real dynamic graph Simple Wiki-Web. We used one day as the update interval, and measure the performance of different algorithms. 6.3.1. Update time The comparison of update time is demonstrated in Fig. 9. From the results, we have the following observations. 1. The algorithm incRPCF is significantly faster than incRPCO and re-compress the updated graph from sketch with OptCompR . The experiment on large-dense dataset Cit-HepTh shows updating 30 edges by incRPCF takes several seconds, while that by incRPCO or re-compress the graph from the sketch by OptCompR needs over 10 0 0 seconds. The algorithm incRPCF is fast because it only focus on the changes that directly affect node reachability. 2. The update time of incRPCO is less than that of re-compress from sketch only when a few edges are updated. When lots of edges to be updated, incRPCO can even slower than re-compress the graph from sketch. For example, in Fig. 9(d), the time to update the compressed graph using incRPCO is close to the time to re-compress the updated graph when the edge change percentage is 0.0 04%, 0.0 05%, and 0.0 09%. This is because when lots of edges need to be updated, most of the nodes and edges are in RAA, ERA, and PPA. In this case, the obtained auxiliary graph is not much smaller than the original graph, but it takes extra time to identify the affected area and construct the auxiliary graph. 4

We used linear − ER in [43] for the ER part in DAG − reduction.

246

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249

Fig. 9. Update time comparison.

3. We can notice that the update time of incRPCO is not linear with the percentage of edge changes. Actually, the topology around the randomly selected edges have high impacts on the update time. This is because the update time of incRPCO is mainly defined by the size of the auxiliary graph, and the most time consuming part of incRPCO is to compress the auxiliary graph. Therefore, if the updated edges affects the rechability of many nodes in the original graph, the corresponding auxiliary graph will be large even if a few edges are updated. Conversely, if the updated edges have little effect on the reachability of most nodes in the original graph, the auxiliary graph will be small even if lots of edges are updated. 6.3.2. Graph size Since the compressed graph generated by incRPCO is identical to that of re-compressing the updated graph from sketch by OptCompR , we only recorded the compression ratio of incRPCO , incRPCF in Table 5. Due to space limitations, we only list some representative results in the table. From the table, we can obtain the following observations. 1. The updated compressed graphs generated by incRPCO and incRPCF do not have much difference when a few edges are changed, but the difference becomes significant when the number of changed edges increases. For example, on graph p2p-Gnutella31, when 0.001% of edges are updated, the number of nodes and edges in the graph computed by incRPCO and that computed by incRPCF is very close. However, when the percentage of updated edges increase to 0.009%, the graph updated with incRPCO has 16.45% fewer nodes and 51.75% fewer edges than the graph updated with incRPCF . 2. We can notice that the update results of incRPCO and incRPCF are very similar on graph Wiki-Vote in the table. This is because Wiki-Vote is a voting network where most of nodes have only outgoing edges. Therefore, the update of most

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249

247

Table 5 Compressed graph size comparison. Dataset

|E|%

Simple Wiki-Web

0.25% (Day 1) 2.75% (Day 10) 0.016% 0.056% 0.001% 0.009% 0.001% 0.007%

Wiki-Vote p2p-Gnutella31 Cit-HepTh

incRPCO

incRPCF

|Vr |

|E r |

|Vr |

|Er |

2268 2274 1016 1016 4866 4868 18,820 18,822

2406 2393 1094 1094 5666 5668 37,117 37,119

2322 2790 1016 1017 4867 5669 18,822 18,852

2485 3476 1094 1096 5668 8601 37,121 44,159

Fig. 10. Query processing time comparison.

edges in the original graph will only affect their terminals. In this case, updating the graph with incRPCO and incRPCF will give similar results. 3. Another interesting phenomena is that the graphs updated with incRPCO and incRPCF differ more significantly on the two sparse graphs Simple Wiki-Web and p2p-Gnutella31 than on the two dense graphs Wiki-Vote and Cit-HepTh. This may be because adding or deleting edges in a dense graph does not affect the reachability of most nodes in the graph in many cases.

248

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249

6.3.3. Query processing time The comparison of reachability query processing time on the compressed graph is shown in Fig. 10. From the table, we have the following findings. 1. We can see querying on the compressed graphs is significantly faster than querying on the original graph. This is because query processing time is related to the size of the graph to be queried, and the updated compressed graphs are much smaller than the original graphs. 2. Comparing the query processing time of the graph updated by incRPCO and incRPCF , the difference is more significant on the sparse graphs (i.e., Simple Wiki-Web and p2p-Gnutella31) than on the dense graphs (i.e., Wiki-Vote and CitHepTh). This is consistent with the results in Table 5. 7. Conclusion There is limited research on query preserving compression. From our study, we found that the existing reachability preserving compression algorithm compressR using mess order to construct edges in the compressed graph. As a result, the graphs compressed with compressR may contain redundant edges, and compressing a graph multiple times may produce different compression results, which may cause problems in maintaining the graphs in real life. For incremental query preserving compression, we found that the existing incremental query preserving compression algorithm incRCM may return incorrect compression results in several cases. In this paper, we improve compressR by introducing an edge sorting scheme. In addition, we propose two novel incremental reachability preserving compression algorithms, namely incRPCO and incRPCF . The updated compressed graph generated by incRPCO is identical to the compressed graph generated by re-compressing the updated graph from sketch. On the other hand, algorithm incRPCF is very fast because it only focuses on updating the parts that affect node reachability. We conduct experiments on several public datasets, and the results demonstrate the effectiveness of our approaches. In the future, we will investigate query preserving compression methods of more query classes. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. CRediT authorship contribution statement Yuzhi Liang: Supervision, Formal analysis, Writing - original draft, Methodology, Conceptualization. Chen chen: Conceptualization, Methodology, Software, Data curation, Investigation, Formal analysis. Yukun Wang: Methodology, Software, Validation, Data curation, Investigation. Kai Lei: Resources, Project administration, Funding acquisition. Min Yang: Visualization, Writing - review & editing. Ziyu Lyu: Visualization, Writing - review & editing. Acknowledgment This work was financially supported by Shenzhen Key Laboratory Project (ZDSYS201802051831427) and the project “PCL Future Regional Network Facilities for Large-scale Experiments and Applications”. References [1] Y. Abdelsadek, K. Chelghoum, F. Herrmann, I. Kacem, B. Otjacques, Community extraction and visualization in social networks applied to twitter, Inf. Sci. 424 (2018) 204–223. [2] F. M. Bencic, I. P. Zarko, Distributed ledger technology: blockchain compared to directed acyclic graph (2018) 1569–1570. [3] P. Boldi, S. Vigna, The webgraph framework i: compression techniques, in: Proceedings of the 13th International Conference on World Wide Web, ACM, 2004, pp. 595–602. [4] C. Borgs, M. Brautbar, J. Chayes, B. Lucier, Maximizing social influence in nearly optimal time, Data Struct. Algorithms (2012). [5] R. Bramandia, B. Choi, W.K. Ng, Incremental maintenance of 2-hop labeling of large graphs, IEEE Trans. Knowl. Data Eng. 22 (5) (2009) 682–698. [6] G.S. Brodal, R. Fagerberg, Dynamic representations of sparse graphs, in: Workshop on Algorithms and Data Structures, Springer, 1999, pp. 342–351. [7] Q. Cai, M. Gong, L. Ma, S. Ruan, F. Yuan, L. Jiao, Greedy discrete particle swarm optimization for large-scale social network clustering, Inf. Sci. 316 (2015) 503–516. [8] C. Demetrescu, G.F. Italiano, Dynamic shortest paths and transitive closure: algorithmic techniques and data structures, J. Discrete Algoritms 4 (3) (2006) 353–383. [9] L. Dhulipala, I. Kabiljo, B. Karrer, G. Ottaviano, S. Pupyrev, A. Shalita, Compressing graphs and indexes with recursive graph bisection, in: Acm Sigkdd International Conference on Knowledge Discovery & Data Mining, 2016, pp. 1535–1544. [10] D. Eppstein, G.F. Italiano, R. Tamassia, R.E. Tarjan, J.R. Westbrook, M. Yung, Maintenance of a minimum spanning forest in a dynamic planar graph (1990) 1–11. [11] W. Fan, J. Li, W. Xin, Y. Wu, in: Query preserving graph compression, International Conference on Management of Data, 2012, pp. 157–168. [12] M.R. Henzinger, V. King, Fully dynamic biconnectivity and transitive closure, in: Proceedings of IEEE 36th Annual Foundations of Computer Science, IEEE, 1995, pp. 664–672. [13] J. Iverson, G. Karypis, Storing dynamic graphs: speed vs. storage trade-offs (2014). [14] A. Khan, C. Aggarwal, Query-friendly compression of graph streams, in: IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2016, pp. 130–137. [15] A. Khan, C. Aggarwal, Toward query-friendly compression of rapid graph streams, Soc. Netw. Anal. Min. 7 (1) (2017) 23.

Y. Liang, C. chen and Y. Wang et al. / Information Sciences 520 (2020) 232–249 [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43]

249

J. Leskovec, J. Kleinberg, C. Faloutsos, Graph evolution: densification and shrinking diameters, ACM Trans. Knowl. Discov. Data (TKDD) 1 (1) (2007) 2. J. Leskovec, A. Krevl, SNAP datasets: stanford large network dataset collection, 2014, (http://snap.stanford.edu/data). M. Li, H. Gao, Z. Zou, Maximum steiner connected k-core query processing based on graph compression, J. Softw. 27 (9) (2016) 2265–2277. P. Liakos, K. Papakonstantinopoulou, A. Delis, Memory-optimized distributed graph processing through novel compression techniques, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, ACM, 2016, pp. 2317–2322. P. Liakos, K. Papakonstantinopoulou, M. Sioutis, Pushing the envelope in graph compression, in: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, ACM, 2014, pp. 1549–1558. S. Maneth, F. Peternek, Compressing graphs by grammars, in: IEEE International Conference on Data Engineering, 2016. H. Maserrat, J. Pei, Neighbor query friendly compression of social networks, in: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2010, pp. 533–542. L. Ming, K-reach query processing based on graph compression, J. Softw. (2014). L. Nie, X. Song, T.-S. Chua, Learning from multiple social networks, Synthesis Lectures on Information Concepts, Retrieval, and Servicess, 2016. L. Nie, M. Wang, Z. Zha, T. Chua, Oracle in image search: a content-based approach to performance prediction, ACM Trans. Inf. Syst. 30 (2) (2012) 13. N. Ohsaka, T. Akiba, Y. Yoshida, K. Kawarabayashi, Dynamic influence analysis in evolving networks, in: Very Large Data Bases, 9, 2016, pp. 1077–1088. D.U. Phuong-Hanh, H.D. Pham, N.H. Nguyen, Optimizing the shortest path query on large-scale dynamic directed graph, in: IEEE/ACM International Conference on Big Data Computing, 2016. S. Raghavan, H. Garcia-Molina, Representing web graphs, in: Proceedings 19th International Conference on Data Engineering (Cat. No. 03CH37405), IEEE, 2003, pp. 405–416. Z. Raghebi, F. Banaei-Kashani, Reach me if you can: reachability query in uncertain contact networks, in: Proceedings of the Fifth International ACM SIGMOD Workshop on Managing and Mining Enriched Geo-Spatial Data, ACM, 2018, pp. 19–24. L. Roditty, U. Zwick, A fully dynamic reachability algorithm for directed graphs with an almost linear update time, SIAM J. Comput. 45 (3) (2016) 712–733. A. Sadri, F.D. Salim, Y. Ren, M. Zameni, J. Chan, T. Sellis, Shrink: distance preserving graph compression, Inf. Syst. (2017). D. Sangiorgi, On the origins of bisimulation and coinduction, ACM Trans. Program. Lang. Syst. 31 (4) (2009) 15. N. Sengupta, N. Kasabov, Spike-time encoding as a data compression technique for pattern recognition of temporal data, Inf. Sci. 406 (2017) 133–145. M. Sharir, A strong-connectivity algorithm and its applications in data flow analysis, Comput. Math. Appl. 7 (1) (1981) 67–72. R.E. Tarjan, Depth-first search and linear graph algorithms, SIAM J. Comput. 1 (2) (1972) 146–160. H. Wei, J.X. Yu, C. Lu, R. Jin, Reachability querying: an independent permutation labeling approach, VLDB J. 27 (1) (2018) 1–26. X. Xiao, R. Li, H. Zheng, R. Ye, A. Kumarsangaiah, S. Xia, Novel dynamic multiple classification system for network traffic, Inf. Sci. 479 (2019) 526–541. H. Yildirim, V. Chaoji, M.J. Zaki, GRAIL: Scalable Reachability Index for Large Graphs, VLDB Endowment, 2010. Yildirim H., Chaoji V., Zaki M.J., Dagger: a scalable index for reachability queries in large dynamic graphs, trarXiv:1301.0977 (2013). J. Zhang, K. Zhu, Y. Pei, G.H.L. Fletcher, M. Pechenizkiy, Cluster-preserving sampling from fully-dynamic streaming graphs, Inf. Sci. 482 (2019) 279–300. L. Zhang, C. Xu, W. Qian, A. Zhou, Common neighbor query-friendly triangulation-based large-scale graph compression (2014) 234–243. J. Zhou, J.X. Yu, N. Li, H. Wei, Z. Chen, X. Tang, Accelerating reachability query processing based on dag reduction, VLDB J. 27 (2) (2018) 271–296. J. Zhou, S. Zhou, J.X. Yu, Z. Chen, Z. Chen, X. Tang, Dag reduction: fast answering reachability queries, in: ACM International Conference on Management of Data, 2017, pp. 375–390.