Placing big graph into cloud for parallel processing with a two-phase community-aware approach

Placing big graph into cloud for parallel processing with a two-phase community-aware approach

Future Generation Computer Systems 101 (2019) 1187–1200 Contents lists available at ScienceDirect Future Generation Computer Systems journal homepag...

2MB Sizes 0 Downloads 8 Views

Future Generation Computer Systems 101 (2019) 1187–1200

Contents lists available at ScienceDirect

Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs

Placing big graph into cloud for parallel processing with a two-phase community-aware approach Kekun Hu, Guosun Zeng



Department of Computer Science and Technology, Tongji University, Shanghai, 201804, China Tongji Branch, National Engineering & Technology Center of High Performance Computer, Shanghai, 201804, China

highlights • • • • •

A two-phase community-aware data placement approach is developed. A streaming heuristic for obtaining an initial placement scheme is proposed. A constrained kernel k-means algorithm for optimizing the data placement is proposed. Properties of the proposed approach is analyzed and two optimizations are present. The performance and flexibility of our approach is extensively demonstrated.

article

info

Article history: Received 2 February 2019 Received in revised form 20 May 2019 Accepted 9 July 2019 Available online 17 July 2019 Keywords: Cloud computing Big graph processing Data placement Community detection Scale constraints Modularity density

a b s t r a c t Big graphs are so large that their analysis often rely on the cloud for parallel processing. Data placement, as a key pre-processing step, has a profound impact on the performance of parallel processing. Traditional placement methods fail to preserve graph topologies, leading to poor performance. As the community is the most common structure of big graphs, in this work, we present a two-phase community-aware placement algorithm to place big graphs into the cloud for parallel processing. It can obtain a placement scheme that preserves the community structure well by maximizing the modularity density of the scheme under memory capacity constraints of computational nodes of the cloud in two phases. In the first phase, we design a streaming partitioning heuristic to detect communities based on partial and incomplete graph information. They form an initial placement scheme with relatively high modularity density. To improve it further, in the second phase, we put forward a scale-constrained kernel k-means algorithm. It takes as input the initial placement scheme and iteratively redistributes graph vertices across computational nodes under scale constraints until the modularity density cannot be improved any further. Finally, experiments show that our algorithm can preserve graph topologies well and greatly support parallel processing of big graphs in the cloud. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Along with innovations and advances in information technology, the scale of data generated from different domains such as the Web, social networks, and communication networks, is increasing drastically [1]. Usually, such data have complex structures and can be naturally represented as graphs, with vertices denoting entities and edges denoting interactions between them. We call such data big graph data and simply big graph. Big graphs have big value, and there is a growing need for mining big ∗ Corresponding author at: Department of Computer Science and Technology, Tongji University, Shanghai, 201804, China. E-mail addresses: [email protected] (K. Hu), [email protected] (G. Zeng). https://doi.org/10.1016/j.future.2019.07.014 0167-739X/© 2019 Elsevier B.V. All rights reserved.

graphs. This field, known as big graph analytics, has gradually become the focus of an intense research activity performed by both academia and industry [2]. For big graph analytics, the most direct and simple way is to use just a single computer [3,4]. However, due to its limited computational capability and memory capacity, this way is only suitable for small-scale graphs. With the rapid growth of the scale of graphs, cloud consisting of powerful computational nodes is becoming an ideal infrastructure for big graph analytics [5,6]. Cloud is in essence for parallel and distributed computing [7]. Ahead of parallel processing, big graphs stored in the huge-capacity storage device should be first loaded and distributed to different computational nodes according to specific strategies. This process is known as data placement, the quality

1188

K. Hu and G. Zeng / Future Generation Computer Systems 101 (2019) 1187–1200

of which seriously affects the performance of parallel processing in terms of data access locality, communication overhead, and load balance [8,9]. Among the existing placement strategies [8–17], random-based methods [9,10] place the vertices of a big graph randomly onto different computational nodes of the cloud. Chunk-based methods [11,12] partition the vertex set into contiguous chunks and then place them onto different computational nodes. While each node has roughly balanced data, the data access locality is poor, and the communication overhead is high during the later parallel processing stage. The reason is that these lightweight methods [9–12] seriously destroy graph topologies and result in too many cut edges spanning across computational nodes. Graph partitioning-based methods [8,13] partition a big graph into small subgraphs and distribute them onto different computational nodes. The number of cut edges spanning across computational nodes is small, bringing relatively low communication overhead during the later parallel processing stage. Clustering-based methods [14–17] can generate placement schemes with high cohesion and low coupling. The former indicates that the connection is dense within each subgraph, and the latter implies that the number of cut edges is small between subgraphs. However, the scale of subgraph on each computational node is not equal, leading to load imbalance during the later parallel processing stage. Meanwhile, they [14–17] have initialization problems [18]. Community, a subgraph with high cohesion and low coupling, is the most common and important structure of big graphs [19]. Along with the fact that data access locality exists in graph algorithms [20,21], placing big graphs onto different computational nodes based on communities is necessary and helpful. This strategy can significantly improve the performance of parallel processing of big graphs in the cloud regarding data access locality, communication overhead, and load balance [14]. In response to that, in this work, we present a two-phase community-aware placement algorithm named scale_constrained kernel k-means++ (sckernel k-means++). In the first phase, a streaming partitioning heuristic named Maximal Increase of Modularity Density (MIMD) is designed to detect communities under scale constraints imposed by the memory capacities of the computational nodes. These communities are temporarily stored onto the corresponding computational nodes, forming an initial placement scheme with relatively high modularity density. To improve it further, in the second phase, we present a scale-constrained kernel k-means algorithm. It takes as input the initial placement scheme, and iteratively adjust the distribution of graph vertices on computational nodes under the scale constraints until the modularity density cannot be improved further. The innovations and contributions of this work are as follows: (1) We put forward the idea of community-aware data placement. It can preserve the structure of a big graph as much as possible when placing it in the cloud. (2) We develop a two-phase strategy, making full use of the huge memory capacity and powerful computing capability of the cloud. The first phase temporarily stores a big graph onto computational nodes and accumulates its global information. Based on this, the second phase optimizes the initial placement scheme to a near-optimal one. (3) We propose a two-phase community-aware data placement algorithm named sckernel k-means++ and prove its convergence. It can quickly obtain a high-quality solution for placing big graphs into the cloud. (4) We present two optimizations of the proposed algorithm. Both can greatly accelerate the execution speed of the proposed algorithm. The remaining of this work is organized as follows: Section 2 reviews the related work on big graph placement strategies;

Section 3 formulates the big graph placement problem and introduces a traditional community detection algorithm named kernel k-means. Section 4 presents our two-phase community-aware data placement algorithm by generalizing the kernel k-means. It places a big graph onto computational nodes of the cloud by detecting communities with the goal of maximizing the modularity density under scale constraints. Section 5 analyzes the convergence and complexity properties of the proposed algorithm and present two optimizations. Section 6 conducts experiments to demonstrate its performance. Section 7 concludes this work with future directions. 2. Related work Data placement is one of the most important pre-processing steps for big graph analytics. Many algorithms have been proposed. They can be summarized as the following three kinds. Graph partitioning. It mainly studies how to partition a graph into k smaller subgraphs while minimizing the total number of cut edges spanning across subgraphs. According to whether the scale of each subgraph is the same or not, it can be divided into balanced partitioning [22] and unbalanced partitioning [23,24]. When k>2, both are NP-complete problems [25]. Thus, many heuristics for addressing these problems are proposed, such as METIS [22] and H-load [26]. Recently, streaming graph partitioning methods [22,25–28] emerge as a response to the incompetent of traditional algorithms at partitioning big graphs. These algorithms can be used to place big graphs into the cloud when k is equal to the number of computational nodes, with the scale constraints being the memory capacities of these nodes. Big graph placement in the cloud. The current big graph placement methods can be roughly divided into three categories: random ones [9,10], chunk-based ones [11,12], and clusteringbased ones [14–17]. The former [9,10] use a random function to determine onto which computational node each vertex should be placed in the cloud. They [9,10] pursue the balanced distribution of data in the cloud at the cost of seriously destroying the structure of a big graph. This strategy results in huge communication overhead and poor data access locality during the later parallel processing stage. Chunk-based methods [11,12] first split the vertex set into contiguous chunks and then place them onto computational nodes accordingly. They have similar advantages and disadvantages as random ones. Clustering-based methods [14–17] first groups a big graph into smaller subgraphs, and then distributes them onto different computational nodes. Among them, Chen et al. [14] study the data placement problem in social networks with the aim of reducing communication overhead. The author proposes a modularity-driven clustering algorithm. It groups a big graph into smaller subgraphs with high cohesion and low coupling. Then it places subgraphs with strong connectivity on computational nodes with a short distance. Vengadeswaran and Balasundaram [15] studies the problem of big data placement in HDFS with the goal of improving data access locality. The authors first establish the dependency graph of data blocks. Then they use a Markov clustering algorithm to cluster the graph into smaller subgraphs and put together data blocks of the same subgraph. The problem studied in [16] is similar to that in [14]. Their difference is that the former uses the k-means algorithm instead. Leng et al. [17] study the big RDF data placement problem in the cloud with the aim of maximizing load balance and data access locality. The authors adopt a two-phase approach. The first phase uses a label propagation algorithm to coarse the big RDF graph into a smaller one. The second phase uses the k-medoids clustering algorithm to cluster the coarsegrained graph to obtain the final placement scheme. Its quality is not good because of the coarse granularity of clustering. This

K. Hu and G. Zeng / Future Generation Computer Systems 101 (2019) 1187–1200

work, however, does inspire us to take a two-phase approach to address the community-aware big graph placement problem in the cloud. Community detection of big graphs. Typical community detection methods include graph partitioning, hierarchical clustering, spectral clustering, partitional clustering, and optimizationbased methods [29]. Kernel k-means [30], as a representative of partitioned clustering methods, has the advantages of low computational cost. It first implicitly maps a graph embedded in a vector space to a high-dimensional feature space through a nonlinear function, and then performs the traditional k-means algorithm on the mapped data in the feature space. However, it is sensitive to initializations and lacks constraints on the scale of communities. Thus, it cannot be directly applied to addressing the big graph placement problems in the cloud. Our two-phase community-aware data placement algorithm is based on the kernel k-means algorithm but different with it in the following two aspects: first, we design a streaming partitioning heuristic MIMD to obtain an initial placement scheme with relatively high modularity density, which can greatly accelerate the convergence speed of the latter phase; second, to ensure the scale of communities detected be smaller than the memory capacities of the corresponding computational nodes, we generalize it by adding scale constraints. These two improvements make our algorithm quickly obtain a near-optimal placement scheme for clouds. 3. Community-aware big graph placement problem

Fig. 1. A simple example of a community and its complement graph.

From Fig. 1, we can see that a community is a special subgraph with high cohesion and low coupling. To a certain degree, high cohesion indicates high data access locality, and low coupling indicates low communication overhead during the later parallel processing stage. To quantitatively measure the degree of cohesion and coupling of a community, this paper learns from the definition of modularity density in [31,32] and gives the following definition. Definition 2 (Community Modularity Density). Suppose that Ci = (Vi , Ei ) is a community of the big graph G, then its community modularity density is defined as: D(Ci ) = θ (Ci ) − χ (Ci , Ci ) =

3.1. Big graph and its traditional placement A big graph refers to a large-scale dataset with complex structure. It is difficult to be processed by a single personal computer within a tolerable period of time. A big graph may be directed or undirected, weighted or unweighted, connected or disconnected. Without loss of generality, we assume that it is an undirected and unweighted graph. It is denoted as G = (V, E), where V = {v1 , v2 , . . . , vn }, E = {eij |vi , vj ∈ V } ⊆ V × V , represent its vertex set and edge set, respectively. G’s adjacency matrix is denoted as A = [a1 , a2 , . . . , an ], where column vector ai ∈ {0, 1}n denotes the adjacency relationship between vi and other vertices. ∀i, j∈[1, n], if vi and vj are adjacent, then aij = 1 and 0 otherwise. G is stored as the adjacency matrix on a huge-capacity storage device. The storage it consumes is defined as the scale of G, which is denoted as scale(G). We assume that scale(G) = |V | = n, where |V | denotes the number of vertices of G. Let the cloud P be made up of k computational nodes pi (1 ≤ i ≤ k) connected by a high-speed network. The computing capability and memory capacity of pi are ri and si , respectively. Ahead of parallel processing, G should be loaded from the hugecapacity storage device and distributed to computational nodes of P in advance. Traditional placement algorithms [8–10,13–17, 21,22,25–28] divide G into k smaller subgraphs Gi = (Vi , Ei ) and place Gi onto pi , where scale(Gi ) ≤ si (1 ≤ i ≤ k). πk is denoted as the set of subgraphs. of G if ∀i, j∈[1, ⋃rIt is a k-way partitioning ⋃r k], i̸ =j: Vi ∩ Vj = ∅, i=1 Vi = V, and i=1 Ei = E. E(Gi , Gj ) is denoted as the set of cut edges spanning across Gi and Gj . Gi =(Vi , Ei ) is denoted as a complement graph of Gi with respect to G, where Vi = V \Vi and Ei = E \Ei . For convenience, we summarize frequently used notations in this work in Table 1. 3.2. Concept of community Definition 1 (Community). A community is a connected subgraph of a big graph. The vertices inside it are densely connected while sparsely connected to the outside ones.

1189

|Ei | − |E(Ci , Ci )| |Ei | |E(Ci , Ci )| − = , |Vi | |Vi | |Vi | (1)

where D(Ci ), θ (Ci ) and χ (Ci , Ci ) denote the modularity density, cohesion and coupling of Ci , respectively. Assume that πk = {C1 , C2 , . . . , Ck } is the set of all communities of G. We⋃ name πk a community partition if ∀i, j∈[1, k], i̸ =j: Vi ∩ ⋃k k Vj = ∅, i=1 Vi = V, i=1 Ei = E. To quantitatively measure the overall cohesion and coupling of πk , we introduce the definition of community partition modularity density and define it as the sum of the modularity densities of all communities. Formally, D(πk ) is the modularity density of πk , and D(πk ) =

∑ Ci ∈πk

D(Ci ) =

∑ |Ei | − |E(Ci , Ci )| . |Vi |

(2)

Ci ∈πk

We can see that the higher D(πk ) is, the higher the cohesion of all communities is and the lower the coupling among communities is. Thus, we select D(πk ) as a criterion to guide the big graph placement. The oncoming experiments show that it can effectively preserve the community structure, and thereby improve the data access locality and reduce the communication overhead in the later processing stage. 3.3. Problem formulation To place a big graph into the cloud for parallel processing, we argue that preserving its structure is a prerequisite for achieving high-performance of parallel processing. We expect to accomplish this goal by first detecting communities from it according to the number and memory capacities of the computational nodes of the cloud and then place these communities onto the corresponding computational nodes. These communities form a placement scheme. It is a particular community partition of the big graph satisfying scale constraints imposed by the computational nodes of the cloud. As shown in Fig. 2, the big graph G stored in a huge-capacity storage device outside the cloud is first loaded and then placed onto k computational nodes of P. Each node pi has a

1190

K. Hu and G. Zeng / Future Generation Computer Systems 101 (2019) 1187–1200 Table 1 Frequently used notations. G/Gi /Gi V /Vi /Vi E/Ei /Ei A/K/ D P/pi /ri /si n/k

πk /π˙ k πk2,t /π˙ k2,∞ 2 ,t 2,t Ci /C˙ i

A big graph/a subgraph/the complement graph of Gi The vertex set of G/the vertex set of Gi /the vertex set of Gi The edge set of G/the edge set of Gi / the edge set of Gi The adjacency matrix of G/a kernel matrix/the modularity density The cloud P/a computational node/the computing capability of node pi /the memory capacity of node pi The number of vertices of G, i.e., the scale of G/the number of computational nodes in the cloud P The community partition of G/the optimal community partition of G The community partition of G at time t of Phase 2/the optimal community partition of G at the end of Phase 2 The status of community Ci at time t of Phase 2/the optimal status of community Ci at time t of Phase 2

Error (SSE), is defined as: SSE(πk ) =

n ∑∑

Wji ∥φ xj − mi ∥2 ,

( )

(4)

Xj ϵπk i=1

where mi = ( x ϵX φ xj )/|Xi | is the center of Xi ; W =[w1 , w2 , . . . , j i wk ] is a n×k partition matrix, whose definition is similar to that ( ) of U in Section 3.3. ∥φ xj − mi ∥2 can be converted to:

( )



2

2

( )

∥φ xj − mi ∥ = Kjj −



xl ϵXi

Kjl

| Xi |

∑ +

xg ϵXi



xl ϵXi

Kgl

(5)

2

|Xi |

where K is the kernel matrix, and Kij = φ (xi )T ·φ xj denotes the distance between φ (xn ) and φ (xn ). Dhillon et al. [30] state that K can be any positive semi-definite matrix.

( )

Fig. 2. An example of the community-aware big graph placement in the cloud.

community Ci with high cohesion and low coupling, and Ci ’s scale does not exceed the memory capacity of pi . Different community partitions have different modularity densities, and the one with the highest value while satisfying the scale constraints can significantly improve the performance of parallel processing of the big graph in the cloud. The Communityaware Big Graph Placement Problem (CBGP) is to find such a community partition of the big graph. First, we assume that: (1) The computing capability and memory capacity of each node of cloud P are different, i.e., r1 ̸ = r2 ̸ = . . . ̸ = rk , ∑ s1 ̸ = s2 ̸ = . . . ̸ = sk ; and (2) P can hold G, i.e., scale(G) = |V | ≤ pi ∈CN si . Then the CBGP can be formulated as:

⎞ ⎛ ⎧ ⎪ ∑ |Ei | − |E(Ci , Ci )| ⎪ ⎪ ⎠, ⎪ ⎨π˙ k = argmax ⎝ | Vi | πk ∈Ω Ci ∈πk ⎪ ⎪ 0 < |Vi | ≤ si , ∀i ∈ [1, k] , ⎪ ⎪ ⎩ UT U = I,

ˆ then min(SSE(πk )) ⇔ Theorem 1. If K = 2A − H , Z = U, max(D(πk )), where degree matrix H = [h1 , h2 , . . . , hn ] is a diagonal ∑n matrix and Hii = j=1 aij , i ∈ [1, n]. Proof. SSE(πk ) =



Xj ϵπk

∑n

( )

i=1

Wji ∥φ xj −mi ∥2 =



Xi ϵπk

∥Φi (I −

eeT 2 |X i | ) ,

∥ where Φi = [φ (xi1 ), φ (xi2 ), . . . , φ (xi X )], and e is an all| i|

one column vector with the corresponding dimension. Let tr(A) T denote the trace of matrix A. According to tr(AAT ) = tr(A ( T A)) = ∑ eeT eeT T 2 T ∥A∥F , SSE(πk ) = X ϵπ tr(Φi (I − |X | )(I − |X | ) Φi ) = tr Φ Φ − i

k

i



z1

tr(Z T Φ T Φ Z ), where Z = ⎣



..

⎤i

⎥ ⎦, and z i =

.

e

|Xi |1/2

, i ∈

zk

[1, k]. As Φ T Φ = K and tr(Φ T Φ ) are constants, min(SSE(πk )) ⇔ (3)

where U = [u1 , u2 , . . . , uk ] is a n×k partition matrix, and ui denotes the vertex composition of community Ci ; if ∀j∈[1, n], i∈[1, k]: vj ∈ Vi , then uji = 1 and 0 otherwise; I is a k×k identical matrix. Thus, the CBGP can be regarded as a community detection problem with scale constraints. 3.4. Community detection: Kernel k-means method This section briefly discusses a community detection method named kernel k-means [30], which is the base of the proposed placement algorithm. Kernel k-means extends the traditional kmeans algorithm by using a kernel function and a non-linear mapping φ . Let the input data set be X = {x1 , x2 , . . . , xn }, where xi represents the coordinate of the vertex vi of G in a vector space { }k (i∈[1, k]).πk = Xj j=1 is a k-way partition of X. The objective function of the kernel k-means, denoted as the Sum of Squared

max(tr(Z T KZ )). D(πk ) =





|Ei |−|E (Ci ,Ci )| Ci ∈πk |Vi |

uTi (2A−H )ui ui ∈U ui

(uTi ui )1/2

uTi ui

(

=



uTi Aui − uTi Hui −uTi Aui

ui ∈U

uTi ui

)

=

ˆi = . Normalize ui with respect to (uTi ui )1/2 , i.e., u

. Then D(πk ) =



ui ∈U

(

)

ˆi T (2A − H)u ˆi = tr Uˆ T (2A − H)Uˆ , u

( = [uˆ1 , uˆ2 , . . . , uˆk ]. Thus, max(D(πk ))⇔ max(tr Uˆ T ) (2A − H ) Uˆ )⇔ max(tr(Z T KZ ))⇔ min(SSE(πk )).

where Uˆ

From Theorem 1, we can see that on the one hand, a kway community partition with the smallest SSE (πk ) corresponds to the one with the highest modularity density. On the other hand, the community partition obtained may not be a placement scheme if the scales of communities are too large to place onto the computational node of the cloud. In addition, kernel k-means method is sensitive to initializations [33]. Thus, it cannot be directly applied to addressing the CBGP. We will improve it to address the CBGP next.

K. Hu and G. Zeng / Future Generation Computer Systems 101 (2019) 1187–1200

1191

the following equation: ind(vj ) =

arg max

( (

i∈[1,k],si ≥ Vit +1

| |

1 ,t

D πk

) ( )) ⊕ (C1i ,t ← vj ) − D πk1,t , (6)

1,t Ci

1,t

where ← vj denotes that vj is assigned to Ci at time t+1, 1,t +1 1 ,t 1,t which changes to Ci afterward; πkt +1 = πk ⊕ (Ci ← vj ) = 1 ,t 1,t 1,t 1,t t +1 {C1 , C2 , . . . , Ci ←vj , . . . , Ck } represents that πk is obtained by 1,t 1,t +1 changing Ci to Ci while the others are unchanged. When all 1,|V | 1,|V | 1,|V | 1,|V | vertices are loaded, i.e., t = |V |, πk = {C1 , C2 , . . . , Ck } is the initial placement scheme with relatively high modularity density. Fig. 3. The process of the proposed placement algorithm.

4.3. Sckernel k-means to obtain the near-optimal placement scheme 4. Two-phase community-aware placement algorithm 4.1. Basic idea The basic idea of the sckernel k-means++ algorithm is to obtain a placement scheme for the big graph G to be parallelly processed in the cloud P by detecting community partition with scale constraints in two phases. The first phase runs the MIMD to load G from the huge-capacity storage device outside and partition G into smaller communities in a streaming manner with the aim of maximizing modularity density according to the number and memory capacities of the computational nodes of P. It obtains the initial placement with relatively high modularity density. The second phase runs the sckernel k-means algorithm to iteratively adjust the vertex distribution of the initial placement scheme under the constraints of memory capacities of computational nodes until the modularity density cannot be improved any further. Then a near-optimal placement scheme is obtained. Fig. 3 shows the process of our two-phase community-aware placement algorithm. Initially, the big graph G is stored on the huge-capacity storage device outside the cloud. A placer first runs the MIMD heuristic to partition G passed by the loader into k smaller communities in a streaming manner. They are temporarily stored in the corresponding computational nodes of P. Then the placer runs the sckernel k-means algorithm to iteratively adjust the initial placement scheme until a near-optimal one is obtained. To avoid duplications, G and the high-speed interconnection network is omitted in Fig. 3. 4.2. MIMD to obtain the initial placement scheme The main task of the first phase is to obtain an initial placement scheme with relatively high modularity density. However, the large-scale and complex structure of the big graph G pose big challenges to it. We design a streaming partitioning heuristic named MIMD considering the need to load G from the hugecapacity storage device and the advantages of streaming partitioning method such as single-pass and lightweight. Its main idea is to place each vertex passed by the loader onto the computational node of P with the maximum improvement of modularity density under the memory capacity constraint of this node. Ties are broken randomly. These constraints ensure that the obtained community partition satisfies the definition of a data placement 1,t 1,t 1 ,t 1,t scheme. Let πk = {C1 , C2 , . . . , Ck } denote the community 1,t partition at time t in the first phase, and Ci represent the status of Ci on the computational node pi at time t (i∈[1, k]). Then, for the vertex vj arrived at time t+1, its computational node index ind(vj ) denoting where vj should be placed, can be determined by

The initial data placement scheme is good but not enough regarding the modularity density. To improve it further, in the second phase, we present an algorithm named sckernel k-means. It is a generalization of the tradition kernel k-means by adding 1,|V | scale constraints. Given the input πk , it iteratively executes the following two steps, until the modularity density no longer changes: (1) fix the center of each community, and assign each vertex to the community to which its nearest center belongs under the scale constraints; (2) update all community centers. Note that community centers cannot be calculated explicitly because the function φ is unknown. Among these two steps, the first is the key. Essentially, it is a vertex assignment problem. To address it, we propose a greedy-based assignment strategy named GReedy Assignment with Scale ConStraints (GRASS). Its basic idea is first to logically assign each vertex to computational nodes of the cloud according to the principle of the shortest distance, ignoring the scale constraints. Then, it adjusts the communities whose scales exceed the corresponding memory capacities of the computational nodes where they are placed. This adjustment is done by redistributing the excessive vertices of these communities to computational nodes with residual memory capacities to accommodate them in the non-increasing order of the gain of best over last assignments. The pseudo-code of the sckernel k-means and GRASS are shown in Algorithms 1 and 2, respectively.

5. Algorithm analysis and optimization In this section, we theoretically analyze the complexity and convergence properties of the sckernel k-means algorithm. To reduce its complexity, we give a pruning strategy based on the principle of triangular inequality. It can reduce the number of unnecessary distance calculations between vertices and community centers. Besides, we present a data parallel strategy to parallelize the sckernel k-means algorithm. It can greatly speed up the process of big graph data placement. 5.1. Theoretical analysis Theorem 2. Sckernel k-means algorithm can terminate in a finite number of iterations and converge to a locally optimal solution. This solution cannot be improved regarding modularity density by adjusting a vertex to a different community without violating the memory capacity constraints. Proof. At each iteration, the assignment step cannot increase 2 ,t the value of the objective function SSE(πk ); the community 2,t membership update step either reduces the value of SSE(πk ) or 2 ,t terminates. Thus, SSE(πk ) is a non-increasing function of itera2 ,t tion t. As the lower bound of SSE(πk ) is zero, and the number of community partitions of G is limited [34], the algorithm will

1192

K. Hu and G. Zeng / Future Generation Computer Systems 101 (2019) 1187–1200

2,t

terminate in a finite number of iterations. Since SSE(πk ) is a nonconvex function, it converges to a locally optimal solution when the algorithm terminates without guaranteeing that it is globally optimal. Corollary. When the sckernel k-means algorithm terminates, each vertex of G is placed onto the computational node where the community to which its nearest community center belongs is located.

Proof. It can be easily derived from Theorem 3. Theorem 3. If the sckernel k-means algorithm terminates in tmax iterations, then its time complexity is ((2 + 1/k) · n2 + k) · tmax τflop , where n denotes the number of vertices of graph G and τflop the time required for performing a floating point operation on a standard computational node p.

K. Hu and G. Zeng / Future Generation Computer Systems 101 (2019) 1187–1200

1193

Proof. Assume that every addition, multiplication, division, or comparison is counted as one floating point operation (flop). In each iteration, the distance calculations for all pairs of vertices require n2 flops, the community center recalculations require n2 +k flops; the distance sort requires roughly k·(n/k)2 = n2 /k flops. Putting them together, we can conclude that the time complexity T1 of the proposed algorithm is as follows: T1 = ((2 + 1/k)·n2 + k)·tmax τflop .

(7)

5.2. Algorithm optimization Optimization 1: Reduce unnecessary distance calculations. As the graph scale continues to increase, the execution time of our algorithm increases rapidly, leading to poor scalability. To reduce the complexity and improve the scalability of our algorithm, in this section, we propose a pruning strategy based on the principle of triangular inequality [35]. It can reduce the number of unnecessary distance calculations. The basic idea is that the distance calculation for any vertex vj ∈V and any new community 2,t +1 is necessary only when the lower bound of this center mh distance is smaller than the distance between vj and its current 2 ,t community center mi . According to the principle of triangular 2,t +1 2,t ∈πk2,t +1 , dist(m2h,t +1 , vj ) follows inequality [35], ∀vj ∈Ci , Ch the inequation below: 2,t +1

2,t

dist(mh , vj ) − dist(mh

, m2h,t )≤dist(m2h,t +1 , vj )

≤dist(m2h,t , vj ) + dist(m2h,t +1 , m2h,t ). 2,t +1

(8) 2,t +1

Let dist(mh , vj ) denote the lower bound of dist(mh , vj ), 2,t +1 , vj ) = dist(m2h,t , vj )-dist(m2h,t +1 , m2h,t ). Only when then dist(mh 2,t +1 dist(mh , vj )
Lthj = dist(mh

, vj ) otherwise. In addition, we also maintain 2,t +1

, a k×k diagonal matrix Qt , where ∀h∈[1, k], Qthh =dist(mh 2,t mh ). In the following section, we will see that this strategy can significantly reduce the computational overhead and thus increase the scalability of our algorithm. Optimization 2: Parallelization. The sckernel k-means algorithm is centralized. It will face scalability issues as the data scale continues to increase rapidly. As we all know, algorithms like k-means are inherently data-parallel and thus are easy to parallelize. By taking advantage of this merit, we try to parallelize the centralized one and present its parallel version named parsckernel k-means. It is based on the SPMD (Single Program Multiple Data) model by using MPI (Message passing interface), which is a widely used communication protocol for programming parallel and distributed computers. Observe that the distance calculations in lines 7∼9 of Algorithm 1 and the assignment of vertices to communities in 2∼16 of Algorithm 2 are data-parallel, they can be executed parallelly and asynchronously for each vertex. Besides, these lines dominate the execution time. Thus, we use a simple yet effective data parallel strategy to parallelize these two hotspots in the parsckernel k-means algorithm. Its pseudocode is shown in Algorithms 3 and 4. Only the differences between the parsckernel k-means and the sckernel k-means are shown to avoid redundancy.

Theorem 4. The time complexity of the parsckernel ∑ k-means algorithm ran on the cloud P is ((2n2 + k) · tmax τflop )/(( p ∈P ri )/r) + i n·k·tmax τcomm , where τcomm denotes the time required to transfer a floating point number from one computational node to another of P. Proof. From Theorem 3, we know that the time complexity of the sckernel k-means algorithm is ((2 + 1/k) · n2 + k) · tmax τflop . As both the distance calculations and community center recalculations are data-parallel, all these operations can run parallelly on k computational nodes of the cloud P. Let the computation, communication, and total time complexity of the parsckcomp , Tkcomm , and Tk , ernel k-means algorithm be denoted as Tk comp comp respectively. Then∑Tk can be calculated as Tk = T1 /k′ = 2 (2n +k)·tmax τflop /(( p ∈P ri )/r), where r is the computing capabili ity of a standard computational node p and k’ the normalized number of computational nodes. Tkcomm =n·k·tmax τcomm ∑ . Thus, we comp have Tk = Tk + Tkcomm = ((2n2 + k) · tmax τflop )/(( pi ∈P ri )/r) + n · k · tmax τcomm . Theorem 5. Let the relative speedup, efficiency, and scaleup of the parsckernel k-means algorithm be sp, ε , and ψ , respectively. Then we have the conclusions as given in Box I, where the relative speedup is measured by keeping the problem size fixed and the relative scaleup is measured by keeping the problem size per computational node fixed while increasing the number of nodes. Proof. Eqs. (9) and (10) can be directly derived from∑ Theorems 3 and 4. Eq. (11) can be obtained by replacing n with (( p ∈P ri )/r) · i n in (9) to increase the problem size in terms of the graph scale. 6. Evaluation results We conduct five experiments to evaluate the performance of the proposed algorithm in addressing the CBGP. First, the experimental setups, including the datasets, graph workload, our cloud, and the graph analytics platform, are presented. Then, ten

1194

K. Hu and G. Zeng / Future Generation Computer Systems 101 (2019) 1187–1200

sp = T1 /Tk =

ε = sp/k′ =

((2 + 1/k) · n2 + k) · tmax · τflop ((2 ·

n2

+ k) · tmax · τflop )/((



pi ∈P

ri )/r) + n · k · tmax · τcomm

=

((2 + 1/k) · n2 + k) · τflop ((2 ·

((2 + 1/k) · n2 + k) · τflop (((2 · n2 + k) · τflop )/((



pi ∈P

ri )/r) + n · k · τcomm ) · ((



pi ∈P

ri )/r)

=

n2

+ k) · τflop )/((



pi ∈P

ri )/r) + n · k · τcomm

,

((2 + 1/k) · n2 + k) · τflop (2 · n2 + k) · τflop + n · k · τcomm · ((



pi ∈P

ri )/r)

(9)

, (10)

ψ=

((2 + 1/k) · n + k) · τflop 2

((2 · ((



pi ∈P

ri )/r) · n2 + rk/



pi ∈P

ri ) · τflop ) + (



pi ∈P

ri ) · n · k · τcomm /r

,

(11)

Box I.

evaluation metrics from three categories are designed to evaluate the algorithm. At last, experiments are carried out, and their experimental results are discussed. 6.1. Experiment setup Graph datasets. We collect three big graphs from multiple sources and store them on our huge-capacity disk array. The first two are webs downloaded from the Laboratory for Web Algorithms website [36]. They are collected by using the framework WebGraph [37] and LLP [38]. The last one is a big synthetic graph generated by the R-MAT tool [39]. See Table 2 for more details. Graph workload. We select one typical local graph query and two representative graph algorithms to evaluate the ability of the proposed placement algorithm in supporting the parallel processing of big graphs: (1) h-hop neighbor search query. It takes as input the big graph G, a starting vertex v, the number of hops h, and does a breadth-first search from v, returning all the vertices within h hops of v. This query is widely used in friend recommendation, information retrieval, and machine learning. (2) PageRank. It takes as input G and an optional maximal iteration number to iteratively measure the importance of web pages. This algorithm can be used to rank many other objects, such as football players and social network users. (3) SSSP. It takes as input G and a source vertex to find the shortest path between it and every other vertex. This algorithm has been extensively used in navigation, route planning, and location-based services. Cloud. We use OpenStack as an Infrastructure as a Service software to deploy our private cloud P on a cluster of hardware machines including a T.t Desktop server, a 4-node PowerEdge 6850 server, a 4-node Transwarp TxData-4 server, and a Wuzhou S920G2 5-blade server. They are connected through the Gigabit Ethernet. For convenience, they are numbered from 1 to 14. See Table 3 for their detailed configurations. The deployed cloud is the Pike version of OpenStack with bare metal provisioning enabled. By leveraging this technique, it can provide high-performance computing clusters while retaining the advantages of auto-scaling and easy-to-manage of traditional cloud. In addition, there is a Sun StorageTekTM 5320 NAS system with 60 TB storage capacity. It is used for storing big graph datasets and is connected to the P with fiber channel cables that support transfer a rate of up to 5 Gbps. Graph analytics platform. We select Apache Giraph 1.1.0 as our graph analytics platform, which is an open-source implementation of the proprietary Pregel [5] of Google. Giraph is a state-of-the-art distributed graph processing system based on the BSP programming model [40]. We set up a cluster on the OpenStack bare metal cloud P to deploy Giraph to run local graph queries and graph algorithms. Note that Giraph’s default chunk-based placement algorithm is modified to adapt to the heterogeneous settings.

We implement the proposed placement algorithm and another two state-of-the-art methods BRGP [17] and Combined (CB) [6] in Java and integrate them into Giraph to evaluate and compare their performance. The former is a two-phase data placement algorithm for big graphs. It uses a label propagation algorithm to coarse a big graph into a smaller one and then uses the k-medoids clustering algorithm to cluster the coarse-grained graph to obtain the final placement scheme. CB [6] is a heterogeneous aware streaming graph partitioning heuristic. It divides graph vertex set into high degree and low degree sets, and assigns these two sets of vertices by using different strategies in a streaming manner. 6.2. Evaluation metrics We evaluate the proposed algorithm from the following three aspects: placing quality, placing performance, and the performance of the parallel processing of local graph queries and graph algorithms. The first includes modularity density D(πk ) [31], edge cut ratio λd [23], and balance of data placement ρd [23]. The second consists of the runtime τp needed to obtain a nearoptimal placement scheme and the convergence speed, the relative speedup sp [41], the relative efficiency ε [41] and the relative scalability ψ [41]. The last one includes throughput (queries only), turnaround time τt (graph algorithms only), communication overhead, and load balance ρw . The quantitative ⏐⋃m definitions ( )⏐ ⏐ ⏐ of all the above metrics are as follows: (1) λ = d ( ) ( ) i=1 E Ci , Ci /

|E |; (2) ρd = argmax(scale C2i ,∞ /scaleE C2i ,∞ ), where scaleE i∈[1...k] ( ) ∑ 2,∞ Ci = (si · scale (G))/ j∈[1,k] sj denotes the expected scale of 2,∞

Ci that should be placed onto pi in a perfectly balanced data placement scheme; (3) τp = τp1 + τp2 , where τp1 and τp2 denotes the time spent in the first and second phases, respectively; (4) sp=τp /τm , where τm denotes the runtime of the parallel version of the proposed algorithm on m(1≤m≤k) computational nodes; (5) ε = sp/m; (6) ψ is defined as the ratio of the execution time of the proposed algorithm for placing a big graph G onto a computational node p to the execution time of the parallel version of the proposed ∑ algorithm for placing another big graph G’ with scale being ( p ∈P ri ) · scale(G)/r onto m-computational-node cluster. i (7) convergence speed describes the rate of changes of D(πk ) with respect to τp ; (5) throughput denotes the number of query results returned from per second; (8) communication overhead refers to the number of across-server messages generated per second; (9) ρw = argmax(load (pi ) /loadE (pi )), where load (pi ) and loadE (pi ) i∈[1...k]

denote the actual and expected amount of load assigned to pi , respectively. For graph queries, they are calculated as the real and expected number of queries assigned. For graph algorithms, they are calculated according to the approach proposed in [20].

K. Hu and G. Zeng / Future Generation Computer Systems 101 (2019) 1187–1200

1195

Table 2 Graph datasets summary. Name

Vertices

Edges

Avg. degree

Type

Source

uk-2007-05 clueweb12 RMAT

105, 896, 555 978, 408, 098 761, 927, 198

3, 738, 733, 648 42, 574, 107, 469 38, 096, 359, 915

71 87 100

Web Web Synthetic

[37,38] [37,38] [39]

Table 3 Configurations of Our Cloud P. #Node

CPU

Memory

Disk

NIC

OS

Compiler

Node Node Node Node

AMD Phenom(tm) II X6 1100T Intel(R) Xeon(TM) dual-core [email protected] GHz Intel(R) Xeon(R) E5-2620 [email protected] GHz Intel(R) Xeon(R) E5-2640 [email protected] GHz

16 G 32 G 128 G 128 G

1T 1T 6T 1T

1 1 1 1

centos-7-3.1611 centos-7-3.1611 centos-7-3.1611 centos-7-3.1611

gcc-4.8.5 gcc-4.8.5 gcc-4.8.5 gcc-4.8.5

1 2∼5 6∼9 10∼14

6.3. Experiments and analysis We design five experiments in this section. The first four are to evaluate the placing quality, the placing performance, and the effect of our pruning strategy, respectively. The last one is to evaluate the ability of the proposed algorithm in supporting parallel processing of the local graph query and graph algorithms on three big graphs. In each of these experiments, we set K = 2A-H+σ I instead of K = 2A-H, where I is an identical matrix of the corresponding dimension and integer σ is big enough to ensure that K is semi-positive. For three big graphs, we set σ = 3392831, 131216732, and 165826132, respectively, according to the method presented in [42]. All these experiments are conducted for five times and the average results are presented. Experiment 1: It is designed to evaluate the quality of the placement schemes obtained by the proposed algorithm for three big graphs in the cloud P. Evaluation metrics include modularity density D(πk ), edge cut ratio λd , and balance of data placement ρd . We select the BRGP, MIMD, CB, and Chunk as comparisons. Experimental results are shown in Fig. 4. From Fig. 4, we can see that our algorithm performs the best on metrics D(πk ) and λd while the worst on the metric ρd in placing all three big graphs compared with the other three methods. This finding implies that D(πk ) has a positive correlation with λd while an inverse correlation with ρd . Taking the dataset RMAT as example, the D(πk ) of the placement scheme obtained by our algorithm is 12%, 26%, 47%, and 273% higher than that by BRGP, MIMD, CB, and Chunk, respectively; λd by the proposed algorithm is 12%, 25%, 41%, and 79% lower than that by BRGP, MIMD, CB, and Chunk, respectively; ρd by the proposed algorithm is 15%, 10%, 8%, and 5% higher than that by MIMD, CB, and Chunk, respectively. The reason why sckernel k-means++ performs the best on metric D(πk ) is that it aims at maximizing D(πk ) in the two successive phases while CB only pursues high cohesion of each community, and Chunk does not optimize D(πk ) at all. The larger the D(πk ) is, the higher the connection density within each community is and the lower the connection density across communities is. Thus, it is not surprising that ours performs the best on metric λd . But this does not come at no cost. The proposed algorithm sacrifices ρd for a higher D(πk ). This is because balanced data distribution among computational nodes is only a general constraint rather than the optimization goal in our algorithm. Another interesting finding is that D(πk ) is positively correlated with the size and average degree of the dataset. This can be explained by (1) and (2). Experiment 2: It is designed to evaluate the placing performance of the proposed algorithm for placing three big graphs into the cloud P. Evaluation metrics include τp and convergence speed. We replace MIMD in the first phase of the proposed algorithm with CB, Chunk, and denote them as ‘‘CB+P2’’ and ‘‘Chunk+P2’’, respectively. They along with BRGP and Chunk are selected as comparisons. Experimental results are shown in Fig. 5. Due to

HDD HDD HDD HDD

Gbps Gbps Gbps Gbps

space limits, clueweb12, and uk-2007-05 are abbreviated to cw and uk, respectively in Fig. 5(b). It can be seen from Fig. 5(a) that the proposed placement algorithm converges to a locally optimal solution faster than ‘‘CB+P2’’ and ‘‘Chunk+P2’’ but a little slower than BRGP. This superiority of our algorithm over ‘‘CB+P2’’ and ‘‘Chunk+P2’’ is because that the sckernel k-means algorithm is sensitive to initializations. The higher the quality of the initialization is, the faster it converges. As seen from Fig. 4(a), in the first phase of data placement, our proposed heuristic MIMD achieves the highest Ds on all three big graphs, providing high-quality initializations for the second phase. Thus, there is no surprise that the proposed placement method converges quickly later. It is also reflected in Fig. 5(b), where the slopes of the lines corresponding to the proposed algorithm are the largest. As expected, the lightweight placement algorithm Chunk spends the least time in placing all three big graphs. But we argue that this heuristic is too simple to obtain placement schemes that can sustain high-performance parallel processing of big graphs. This argument will be validated by the following Experiment 4. The reason why our proposed algorithm runs slower than BRGP is that the latter cluster big graphs at a coarser level of granularity and thus requires a smaller number of iterations. But this comes at the cost of reducing the qualities of the placement scheme in terms of D and λd as shown in Fig. 4(a) and (b). We will show next that the placement schemes generated by this algorithm cannot support well the parallel processing of big graphs. Experiment 3: It is designed to validate the effect of our pruning strategy in accelerating the execution speed of the second phase of the proposed algorithm. Evaluation metrics include the reductions of τp2 and the number of distance calculations at each iteration of the second phase. For comparison, we choose the sckernel k-means algorithm without the pruning strategy. Experimental results are shown in Fig. 6. From Fig. 6(a), we can see that the runtime of the sckernel k-means algorithm with pruning strategy is significantly reduced compared to the case without the pruning strategy. Taking the dataset RMAT as an example, the runtime is reduced by 53% after using the pruning strategy. The reason is that the pruning strategy can greatly reduce the number of unnecessary distance calculations between vertices and community centers. This can also be seen from Fig. 6(b). Without the pruning strategy, the number of distance calculations at each iteration is constant. With the pruning strategy, this number decreases rapidly as the number of iterations increases. It is because more and more vertices are placed onto the optimal computational nodes as the number of iterations increases. Although k-means-like approaches are demonstrated to be expensive, our pruning technique is shown to be able to greatly reduce their cost. This technique makes the proposed algorithm suitable for big graph placement.

1196

K. Hu and G. Zeng / Future Generation Computer Systems 101 (2019) 1187–1200

Fig. 4. Placing qualities of different algorithms for placing different big graphs in the cloud. (a) D(πk ). (b) λd . (c) ρd .

Fig. 5. The placing performance of different algorithms for placing different big graphs in the cloud. (a) τp . (b) convergence speed.

Fig. 6. The effect of our pruning strategy. (a) τp2 . (b) # kernel distance computation at each iteration.

Experiment 4: It is designed to evaluate the effect of the

algorithm. Evaluation metrics include the speedup sp, the effi-

parallelization optimization of the second phase of the proposed

ciency ε , and the scalability ψ . We run the parsckernel k-means

K. Hu and G. Zeng / Future Generation Computer Systems 101 (2019) 1187–1200

1197

Fig. 7. The effect of our parallelization optimization. (a) and (b) are the measured speedup and efficiency of the parsckernel k-means algorithm for placing three big graphs when increasing the number of bare metal nodes used, respectively; (c) shows the scaleup behavior of the parsckernel k-means algorithm for placing a series of RMAT graphs with different scales generated by the R-MAT tool when increasing the number of bare metal nodes used. These graphs, which are not listed in Table 2, are of the same type. This makes it easy to keep the problem size per node fixed; and (d) shows the scaleup behavior of the parsckernel k-means algorithm for placing these RMAT graphs when fixing the number of bare metal nodes used.

algorithm on m = 1, 2, 3, . . . , 14 bare metal nodes to place three big graphs onto the cloud P for parallel processing. The speedup study measures the ratio of the execution time for placing a big graph into the cloud on one bare metal node to that for placing the same graph on k bare metal nodes. The efficiency study measures efficiency by increasing the number of bare metal nodes when keeping the big graphs fixed. The scaleup study measures execution times by keeping the community scale per bare metal node fixed when increasing the number of bare metal nodes. Experimental results are shown in Fig. 7. From Fig. 7(a), we can see that the ideal speedup represented by the line without markers is not linear. This is because the computing capabilities of bare metal nodes of the cloud P are different. The achieved speedups corresponding to the RMAT graph and clueweb12 graph approximate to the ideal one while that corresponding to the uk-2007-05 graph not when increasing the number of bare metal nodes used. This is because the scale of the uk-2007-05 graph is relatively small, and not every bare metal node is fed with enough data to fully utilize its computing capability. Thus, the efficiency corresponding to the uk-2007-05 graph is the lowest, which is reflected in Fig. 7(b). The scaleup result shown in Fig. 7(c) reports the execution time per iteration instead of the total execution time to eliminate the impact of initialization and data scaling on the timing measurement. From

Fig. 7(c) we can see that the slope of the scaleup curve is small and thus the parsckernel k-means algorithm has good scaleup with respect to the scale of the dataset. The scaleup result shown in Fig. 7(d) reports the speedup of the parsckernel k-means algorithm when increasing the scales of RAMT graphs and fixing the number of bare metal nodes used. We can see that the speedup of the parsckernel k-means algorithm starts to decline when the scale of the graph is larger than 109 and drops quickly after the scale of the graph reaches 1010 . The reason may be that the graph is too large to be stored in the distributed memory of the cloud when its scale is larger than 1010 . It inevitably introduces expensive disk I/O operations, which degrades the performance. Experiment 5: It is designed to evaluate how well the proposed algorithm supports parallel processing of big graphs in the cloud. The basic idea to evaluate the performance of graph queries and analytics applications given the data placement scheme generated by the proposed algorithm. Evaluation metrics include throughput (queries only), turnaround time (graph algorithms only), communication overhead, and load balance ρw . For queries, this work selects h-hop neighbor search query with h = 5 as an example, and randomly generate 60,000 such queries per second as the graph workload. For graph algorithms, this work run PageRank for 20 iterations and SSSP. For comparisons, we

1198

K. Hu and G. Zeng / Future Generation Computer Systems 101 (2019) 1187–1200

Fig. 8. The performance of parallel processing of big graphs with placement schemes obtained by different algorithms in the cloud. (a) ∼(c) are the throughput, communication overhead, and ρw .of running h-hop search query, respectively; (d) ∼(f) are the turnaround time, communication overhead, and ρw .of running PageRank, respectively; (g) ∼(i) are the turnaround time, communication overhead, and ρw .of running SSSP, respectively.

choose MIMD, CB, and Chunk. Experimental results are shown in Fig. 8. From Fig. 8, we can see that the placement scheme obtained by our proposed method provides the greatest support for running both graph queries and analytics applications on all three big graphs in terms of throughput, runtime, and communication overhead with the cost of tolerable load unbalance. Taking the dataset RMAT as example, the throughput of running h-hop search query on the placement scheme obtained by the proposed algorithm is 23%, 68%, 91%, and 209% higher than that by BRGP, MIMD, CB, and Chunk, respectively; the runtime of PR for 20 iterations on the placement scheme obtained by the proposed algorithm is 38%, 69%, 83%, and 218% smaller than that by BRGP, MIMD, CB, and Chunk, respectively; the runtime of SSSP on the placement scheme obtained by the proposed algorithm is 40%,

72%, 89%, and 227% smaller than that by BRGP, MIMD, CB, and Chunk, respectively. Similar results can also be found for the communication overhead. These results confirm our argument that Chunk, which is a state-of-the-art placement algorithm, cannot support well for high-performance of parallel processing of big graphs. The cost of the gains of our algorithm over Chunk is a 193% increase in the placement time. It is demonstrated to be worthwhile. These experiment results prove that compared with the other three methods, the proposed placement algorithm has greater support for parallel processing of both local graph queries and analytics applications on different datasets in the cloud. Thus, modularity density is a good objective to guide big graph placement for parallel processing in the cloud.

K. Hu and G. Zeng / Future Generation Computer Systems 101 (2019) 1187–1200

1199

7. Conclusions

[16] D. Yuan, Y. Yang, X. Liu, J. Chen, A data placement strategy in scientific cloud workflows, Future Gener. Comput. Syst. 26 (8) (2010) 1200–1214.

In this work, we propose a two-phase community-aware placement algorithm named sckernel k-means++ to place a big graph into a cloud for parallel processing. A streaming partitioning heuristic for obtaining an initial placement scheme with relatively high modularity density is presented. A scale-constrained kernel k-means algorithm for obtaining a near-optimal placement scheme is introduced. A pruning strategy to reduce the algorithm complexity is proposed. A simple data-parallel parallelization strategy to accelerate the proposed algorithm is presented. Experiments show that our algorithm preserves better graph topologies, has higher placing performance, and thus support better big graph parallel processing in the cloud.

[17] Y. Leng, C. Zhikui, F. Zhong, X. Li, Y. Hu, C. Yang, BRGP: A balanced RDF graph partitioning algorithm for cloud storage, Concurr. Comput.: Pract. Exper. 29 (14) (2017) e3896.

Acknowledgments This work was supported by the National Social Science Foundation of China under grant No. 17BQT086; the Subproject of National Seafloor Observatory System of China under grant No. 2970000001/001/016; and CCF Opening Project of Information System under grant No. CCFIS2018-01-03. Declaration of competing interest The authors declared that they had no conflicts of interest with respect to their authorship or the publication of this article. References [1] S. Qiao, N. Han, Y. Gao, R.H. Li, J. Huang, J. Guo, L.A. Gutierrez, X. Wu, A fast parallel community discovery model on complex networks through approximate optimization, IEEE Trans. Knowl. Data Eng. (2018) http://dx.doi.org/10.1109/TKDE.2018.2803818. [2] S. Aridhi, E. Nguifo, Big graph mining: Frameworks and techniques, Big Data Res. (6) (2016) 1–10. [3] A. Kyrola, G. Blelloch, C. Guestrin, GraphChi: Memory capacity graph computation on just a PC, in: Proc. of the USENIX Conference on Operating Systems Design and Implementation, 2012, pp. 31–46. [4] A. Roy, I. Mihailovic, W. Zwaenepoel, X-stream: Edge-centric graph processing using streaming partitions, in: Proc. of the 24 th ACM Symposium on Operating Systems Principles, 2013, pp. 472–488. [5] G. Malewicz, M.H. Austern, A.J. Bik, J.C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, Pregel: A system for big graph processing, in: Proc. of the ACM SIGMOD International Conference on Management of Data, 2010, pp. 135–146. [6] N. Xu, B. Cui, L. Chen, Z. Huang, Y. Shao, Heterogeneous environment aware streaming graph partitioning, IEEE Trans. Knowl. Data Eng. 27 (6) (2015) 1560–1572. [7] E. Sivaraman, R. Manickachezian, Efficient multimedia content storage and allocation in multidimensional cloud computing resources, Int. J. Intell. Syst. Technol. Appl. 18 (1–2) (2019) 20–33. [8] S. Yang, X. Yan, B. Zong, A. Khan, Towards effective partition management for large graphs, in: Proc. of the ACM SIGMOD International Conference on Management of Data, 2012, pp. 517–528. [9] L. Chi, X. Zhu, Hashing techniques: A survey and taxonomy, ACM Comput. Surv. 50 (1) (2017) 11. [10] D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, D. Lewin, Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web, in: Proc. of the ACM Symposium on Theory of computing, 1997, pp. 654–663. [11] Giraph, Apache giraph, 2019, http://giraph.apache.org/, (Accessed January 2019). [12] X. Zhu, W. Chen, W. Zheng, X. Ma, Gemini: A computation-centric distributed graph processing system, in: Proc. of the USENIX Symposium on Operating Systems Design and Implementation, 2016, pp. 301–316. [13] C. Curino, E. Jones, Y. Zhang, S. Madden, Schism: A workload-driven approach to database replication and partitioning, Proc. VLDB Endow. (2010) 48–57. [14] H. Chen, H. Jin, S. Wu, Minimizing inter-server communications by exploiting self-similarity in online social networks, IEEE Trans. Parallel Distrib. Syst. 27 (4) (2016) 1116–1130. [15] S. Vengadeswaran, S. Balasundaram, Grouping-aware data placement in HDFS for data intensive applications based on graph clustering, in: Proc. of the Advances in Computer and Computational Sciences, 2018, pp. 21–31.

[18] S. Poomagal, P. Saranya, S. Karthik, A novel method for selecting initial centroids in k-means clustering algorithm, Int. J. Intell. Syst. Technol. Appl. 15 (3) (2016) 230–239. [19] X. Huang, L. Lakshmanan, J. Xu, Community search over big graphs: Models, algorithms, and opportunities, in: Proc. of the IEEE 33 rd International Conference on Data Engineering, 2017, pp. 1451–1454. [20] S. Beamer, K. Asanovic, D. Patterson, Locality exists in graph processing: Workload characterization on an ivy bridge server, in: Proc. of the International Symposium on Workload Characterization, 2015, pp. 56–65. [21] J. Jin, J. Luo, S. Khemmarat, L. Gao, Querying web-scale knowledge graphs through effective pruning of search space, IEEE Trans. Parallel Distrib. Syst. 28 (8) (2017) 2342–2356. [22] W. Yang, G. Wang, K.K.R. Choo, S. Chen, HEPart: A balanced hypergraph partitioning algorithm for big data applications, Future Gener. Comput. Syst. (83) (2018) 250–268. [23] K.K. Hu, G.S. Zeng, H.W. Jiang, W. Wang, Partitioning big graph with respect to arbitrary proportions in a streaming manner, Future Gener. Comput. Syst. (80) (2018) 1–11. [24] W. Yang, G. Wang, M.Z.A. Bhuiyan, K.K.R. Choo, Hypergraph partitioning for social networks based on information entropy modularity, J. Netw. Comput. Appl. (86) (2017) 59–71. [25] M. Garey, D. Johnson, L. Stockmeyer, Some simplified NP-complete problems, in: Proc. of the 6 th Annual ACM Symposium on Theory of computing, 1974, pp. 47–63. [26] C. Mayer, M.A. Tariq, C. Li, K. Rothermel, GrapH: Heterogeneity-aware graph computation with adaptive partitioning, in: Proc. of the IEEE International Conference on Distributed Computing System, 2016, pp. 118–128. [27] G. Karypis, V. Kumar, Multilevel graph partitioning schemes, in: Proc. of the International Conference on Parallel Processing, 1995, pp. 113–122. [28] I. Stanton, G. Kliot, Streaming graph partitioning for large distributed graphs, in: Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 1222–1230. [29] S. Fortunato, Community detection in graphs, Phys. Rep. 486 (3–5) (2010) 75–174. [30] I.S. Dhillon, Y. Guan, B. Kulis, Kernel k-means: Spectral clustering and normalized cuts, in: Proc. of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 551–556. [31] M. Newman, Fast algorithm for detecting community structure in networks, Phys. Rev. E 69 (6) (2004) 066133. [32] Z. Li, S. Zhang, R. Wang, L. Chen, Quantitative function for community detection, Phys. Rev. E 77 (3) (2008) 036109. [33] G. Tzortzis, A. Likas, The global kernel k-means algorithm for clustering in feature space, IEEE Trans. Neural Netw. 20 (7) (2009) 1181–1194. [34] E. Weisstein, CRC Concise Encyclopedia of Mathematics, Chapman & Hall/CRC, Boca Raton, Florida, 2003, pp. 4414–4418. [35] I. Dhillon, J. Fan, Y. Guan, Efficient clustering of very large document collections, in: Proc. of the Springer Data Mining for Scientific and Engineering Applications, 2001, pp. 357–381. [36] Laboratory for web algorithms, 2018, http://law.di.unimi.it/index.php, (Accessed November 2018). [37] P. Boldi, S. Vigna, The webgraph framework I: compression techniques, in: Proc. of the 13th ACM International Conference on World Wide Web, 2004, pp. 595–602. [38] P. Boldi, M. Rosa, M. Santini, S. Vigna, Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks, in: Proc. of the 20 th ACM International Conference on World Wide Web, 2011, pp. 587–596. [39] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, Z. Ghahramani, Kronecker graphs: An approach to modeling networks, J. Mach. Learn. Res. 11 (3) (2010) 985–1042. [40] L.G. Valiant, A bridging model for parallel computation, Commun. ACM 33 (8) (1990) 103–111. [41] A. Grama, V. Kumar, A. Gupta, et al., Introduction To Parallel Computing, Pearson Education, London, United Kingdom, 2003. [42] R. Bellman, Introduction To Matrix Analysis, SIAM, New York, 1997.

1200

K. Hu and G. Zeng / Future Generation Computer Systems 101 (2019) 1187–1200

Kekun Hu received the M.S. degree in computer science from Shandong University of Science and Technology in 2014. He is currently working toward the Ph.D. degree in computer science at Tongji University. His research interests include parallel computing, big graph data analysis, and performance optimization.

Guosun Zeng received the B.S., M.S., and Ph.D. degrees in computer software and application all from the Department of Computer Science and Engineering, Shanghai Jiao Tong University. He is currently working at Tongji University as a full professor, and as a supervisor of Ph.D. candidates in computer software and theory. His research interests include green computing, parallel computing and information security. He is a senior member of the IEEE.