A distributed overlapping community detection model for large graphs using autoencoder

A distributed overlapping community detection model for large graphs using autoencoder

Accepted Manuscript A distributed overlapping community detection model for large graphs using autoencoder Vandana Bhatia, Rinkle Rani PII: DOI: Refer...

1MB Sizes 1 Downloads 66 Views

Accepted Manuscript A distributed overlapping community detection model for large graphs using autoencoder Vandana Bhatia, Rinkle Rani PII: DOI: Reference:

S0167-739X(17)32786-3 https://doi.org/10.1016/j.future.2018.10.045 FUTURE 4550

To appear in:

Future Generation Computer Systems

Received date : 4 December 2017 Revised date : 25 August 2018 Accepted date : 23 October 2018 Please cite this article as: V. Bhatia and R. Rani, A distributed overlapping community detection model for large graphs using autoencoder, Future Generation Computer Systems (2018), https://doi.org/10.1016/j.future.2018.10.045 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

A Distributed Overlapping Community Detection Model for Large Graphs using Autoencoder Vandana Bhatia∗ Department of Computer Science and Engineering Amity School of Engineering and Technology, Amity University,Noida, India

Rinkle Rani Department of Computer Science and Engineering Thapar Institute of Engineering and Technology, Patiala, India

Abstract Community detection has become pervasive in finding similar patterns present in the network. It aims to discover lower dimensional embedding for representing the structure of network. Many real-life networks comprise overlapping communities and have non-linear features. Despite of having a great potential in analyzing the network structure, the existing approaches provide a limited support and find disjoint communities only. As data is growing unprecedentedly, scalable and intelligent solutions are obligatory for identifying similar patterns. Motivated by the robust representation ability of deep neural network based autoencoder, we proposed a learning model named ‘DeCom’ for finding overlapping communities from large networks. DeCom uses autoencoder based layered approach to initialize candidate seed nodes and to determine the number of communities by considering the network structure. The selected seed nodes and formed clusters are refined in last layer by minimizing the reconstruction error using modularity. The performance of DeCom is compared with three state-of-art clustering algorithms by using real life networks. It is observed that the felicitous selection of seed nodes reduces the number of iterations. The experimental results reveal that the proposed DeCom scales up linearly to handle large graphs and produces better quality of clusters when compared with the other state-of-art clustering algorithms. Keywords: Overlapping Communities, Autoencoder, Large Networks, Pregel 1. Introduction Representing data in the form of graph offer a very powerful way to provide primitive representations in many applications spanning from biological networks, web networks to social networks. In the current era of big data, the size of the graph is increasing exponentially. Community detection or clustering in a large network can provide a good understanding about the ∗

Corresponding author Email addresses: [email protected] (Vandana Bhatia), [email protected] (Rinkle Rani)

Preprint submitted to Elsevier

October 26, 2018

structure of the network for getting useful insights. Communities are the group of vertices which are highly connected with each other, than the vertices of the other communities [26]. It is used for wide variety of applications such as healthcare, social networks [3, 41], collaboration networks, etc. Basically, communities provide a better representation of the structure of the network in a lower dimensional latent space. Many community detection algorithms were proposed in literature. Most of them only highlighted the linear properties of the network. However, relationships among vertices of real-life networks might be non-linear. Deep neural network based autoencoder is efficient in providing lower dimensional embeddings that can have non-linear properties. Autoencoder is well suited for the task of mapping data points into lower-dimensional spaces such as community detection [36, 33]. Some recent research show that K-means and spectral clustering algorithm are suitable for a lower-dimensional data representation that can replace traditional neural networks-based autoencoder [34]. Although the existing algorithm finds the communities using deep structures, they find only the disjoint communities without analyzing the structure of the graph. In addition, for large graphs, the structure of the graph plays an important role in determining the size of the clusters. In many real-world applications like social networks, biological networks, etc., it is necessary to allow overlapping amongst the clusters as data points may inherently belong to more than one cluster [20]. Further, most of the existing algorithms are sensitive to the selection of initial cluster center and require the number of cluster to be mentioned in starting. Subsequently, if the number of clusters in such graphs is predefined, the actual clusters might break up or get merged. It may also result in the creation of clusters even when actually the data cannot be clustered. Also the existing work lacks scalability. The size of the network is increasing at an exponential rate. Hence, the scalable algorithms are in high demand. In this paper, a scalable community detection approach namely ‘DeCom’ is proposed that leverages the idea from autoencoder pipelines to identify overlapping and non-overlapping clusters in network structures. DeCom finds the number of clusters by analyzing the structure of the network. The clusters are formed by restructuring graph into linear structures using random walk based personalized PageRank algorithm. Further, the proposed model minimizes the reconstruction error using modularity. The experimental results show the efficiency of the proposed approach over six large reallife networks. It is shown that the proposed model is similar to the autoencoder in its ability to learn useful representations of the original data in a lower-dimensional space, making the community detection task easier to accomplish. The results reveal that the proposed model is highly scalable and produces better quality of clusters than the competent algorithms. A comparative summary of existing autoencoder based community detection algorithms with the proposed algorithm, is shown in Table 1

2

Table 1 Characteristic summary of deep neural network based community detection algorithms

Approach

Analyze graph structure GraphEncoder [34] No DARG [14] No DEC [38] No Song et al. [31] No DeepWalk [7] Yes Proposed DeCom Yes

Overlapping among clusters No No No No No Yes

No. of clusters predefined Yes Yes Yes Yes Yes No

Handle large graphs

Scalable

No No No No No Yes

No No No No No Yes

The rest of the paper is structured as follows. Section 2 explains the background and related work. The proposed model ’DeCom’ is described in Section 3. The results of the performed experiments are shown in Section 4. Section 5 precisely presents conclusion with recommendation for further research. 2. Background and Related Work In this section, we discussed the existing research and various techniques used in the paper. Given a graph or network G={V, E}, where {V} is the set of N nodes and {E} is the set of

edges. The proposed model performs deep neural network based learning using autoencoder for finding overlapping communities in large networks. Further, it minimizes the reconstruction error using modularity to optimize the detected communities. 2.1. Community Detection In Graphs A community C = (Vc , Ec ) is the subset of graph G having some mutual properties in the network. A community can be disjoint or overlapping as shown in Fig.1. A good Community is a subgraph Ci , such that for each vertex vi ∈ Ci we maximize the intra-cluster edges and minimize the inter-cluster edges as shown in Fig. 1(b). The term community was first widely

used by Newman and Girvan [26], who detected communities by performing natural divisions of nodes into densely connected subgroups in the network. Later, many community detection algorithms were proposed [2]. Throughout the paper, terms community and cluster are used interchangeably. Many overlapping or fuzzy clustering algorithms were proposed in literature [41, 10, 13]. Fuzzy C-means (FCM) is one of the most popular fuzzy clustering algorithm introduced by Bezdek [5]. However, it suffers from the problem of local minima due to the random initialization of cluster heads. To overcome this issue, many other clustering approaches were proposed [12]. An overlapping community detection algorithm named ‘OCDDP’ based on density peaks was proposed [1]. It utilizes a similarity-based method to set distances among nodes. Wang et al. proposed a community detection method for complex networks based on locally 3

(a)

(b)

Fig. 1. (a) A graph G with 30 nodes (b) Detection of Four Communities in graph G

structural similarity measure without any prior knowledge about the number of clusters [35]. Another popular clustering approach introduced in literature is Label Propagation Algorithm (LPA). It is based on the concept of information diffusion in a network. In LPA, each node is assigned a unique label initially and the neighborhood of each node is inspected further. Some dynamic algorithms based on LPA for finding overlapping communities such as COPRA [11] and speaker-listener label propagation algorithm (SLPA) [40] were proposed. In COPRA, label of vertices have information about more than one community that is further propagated among neighboring vertices. SLPA spread label among nodes during iterations and restore previous label information for each node, such that the distribution of label follows specific speaking or listening rules. Parallel overlapping community detection based on SLPA were also proposed in literature [17]. However, both COPRA and SLPA require a parameter to limit the number of communities to which a node can belong [39]. Some scalable solutions were also proposed recently. Ludwig developed a scalable algorithm MR-FCM [21] by implementing FCM on Map-Reduce paradigm. Label propagation algorithm for large data sets using MapReduce programming model is also introduced in recent work [16]. However, for handling large networks, Map-Reduce paradigm is not efficient. Pregel [22] is introduced for handling large graphs in distributed environment. The detailed structure of Pregel is introduced later in this section. Certain recent work also used Pregel based models for graph clustering. Zhang and Ge [42] proposed a parallel algorithm based on Bulk Synchronous Parallel(BSP) for finding overlapping community structure in directed and weighted complex networks. The algorithm iteratively merges similar nature communities for (n-1) times, gets n schemes and selects the scheme that results in the largest modularity as the best scheme. 2.2. Seed Selection approaches For identifying accurate clusters in a network, the selection of seeds or center heads play a vital role. Strategies for finding appropriate seed vertices includes degree centrality, edge4

betweenness, local minimal neighborhoods, etc. [6]. However, if considered solely, they all neglect the distance between the seed and the rest of the vertices of the cluster. 2.3. Modularity Optimization Modularity of a Graph G is a measure of efficiency of the clusters formed within that graph [25]. A graph with large modularity has more intra-edges and less inter-edges in between various clusters. The objective function for calculating modularity for hard clusters is: 1 X  deg(u) ∗ deg(v)  Q= auv − Cu,v 2m 2m 2

(1)

(u,v)∈V

where, auv is the element in adjacency matrix M, m is the total number of edges, deg(u) and deg(v) are the degree of vertices u and v and Cu,v is the cluster containing vertices u and v such that:

    1, if(u, v) ∈ Ci Cuv    0, otherwise

The formula in equation 1 reveals the characteristic tradeoff as many edges should be confined in clusters to maximize the first part while, the second part is minimized by splitting the graph G into clusters with less total degrees or inter-cluster edges. The second part of the equation tells the probability of the existence of an edge between vertices vi and v j . Better clustering results can be achieved by optimizing the value of modularity. For calculating the modularity of overlapping communities, Nepusz [24] introduced a measure for cluster quality. Membership value, µki can be considered as the probability that vertex vi is a member of cluster k. The probability of the case that vertices v and u belongs to the same cluster is the dot product of their membership values, resulting in a similarity measure suv . It can be used for the fuzzy variant of the modularity given by: deg(u) ∗ deg(v)  1 X  auv − su,v . Q f uzzy = 2m 2m 2

(2)

(u,v)∈V

where, suv =

PC

k=1

µkv µku In case of hard clustering, probability µvk = 1 thus, Q f uzzy has the same

value as of Q for each vertex v . 2.4. Autoencoder for Community Detection Autoencoder is a special kind of deep neural network that offers learning in multiple layers by approximating the original graph data to provide a better representation. It is used for performing unsupervised learning in deep architectures and play a vital role in transfer learning and other tasks [36]. Fig. 2 shows the high-level abstraction of an autoencoder based on multiple hidden layers. Autoencoder is trained to reconstruct their own input to find a more informative version. 5

Fig. 2. Autoencoder in deep Neural Network

Consider the inputs xi ∈
It tries to learn a function f (x) such that the final function at last layer will be: yi = g( f (x)) ∈
(3)

Ample literature on deep neural networks and autoencoder is available [36, 18, 33]. Deng et al.[9] designed deep Stacking Network (DSN), a parallelized deep architecture in batch-mode which involves numerous specialized neural networks (modules) with only one hidden layer. Instead of input units, the raw data vector is concentrated with the output layer in lower modules. Hutchinson et al.[15] proposed Tensor Deep Stacking Network (T-DSN), a deep architecture consisting multiple stacked blocks with two hidden layers. A GPU-based framework [28] was designed for parallelizing unsupervised learning models including Deep belief networks. Some deep methods were proposed for community detection recently [34, 29, 32]. Due to the rapid growth of data, deep neural network has also been used in certain distributed platforms. Jhang and Chen [43] presented a distributed deep belief network based on Map-Reduce, exploiting data-level parallelism. In their work, several levels of distributed Restricted Boltzmann machine were stacked and then distributed back-propagation algorithm was used for the fine-tuning. In most of the existing work, back-propagation is used for training the network. However for large databases, fine tuning with back-propagation would suffer from large computational expenses [4, 18]. To deal with this challenge, the proposed model DeCom update the value of vertices in multiple iterations to minimize the reconstruction error. 2.5. Learning from large networks Although deep neural network has been adopted by many applications and provided the efficient outcomes, its training is still a trivial task. It is tremendously difficult to parallelize the iterative computations of deep neural network algorithms. But, in recent years, with the un6

Fig. 3. Working Model of Giraph

precedented growth of digital data in various domains, the deep models can be trained well. To handle such large amount of data and to perform the computations on it, there is a surge for designing parallel, efficient and scalable algorithms to train the deep neural network models. One of the popular distributed framework is Map-Reduce introduced by Dean and Ghemawat [8]. However, native Map-Reduce model is not much efficient for graphs processing because of its expensive disk write operations and huge unnecessary data movements. For handling large distributed graphs, Pregel [22] was introduced which performs in-memory vertexcentric computation using bulk synchronous parallel model by dividing the processing into supersteps and substeps. It exploits fine-grained parallelism at node level because for graph operations, keeping data locally is imperative. Pregel enables fast random access and the reuse of intermediate results for iterative graph algorithms. DeCom uses vertex-centric Pregel model to learn the structure of graph layer by layer. Pregel is based on Bulk Synchronous parallel (BSP) paradigm to work in a distributed manner by dividing vertices of a large graph among the workers in the cluster. It performs computations individually at each worker node. Communication among the workers occurs by passing messages that are generally small in comparison to data and thus are more efficient for transmission. The algorithm terminates when no messages are produced during iteration or until each worker node votes to halt. Pregel is designed for parallel algorithms with two basic operations, Communication via sending messages and barrier Synchronization as shown in Fig. 3. Messages from the previous superstep are available in next superstep. In this paper, iterative BSP based graph processing framework Giraph is used to perform offline batch processing of semi-structured graph data [23]. Giraph works as a sequence of supersteps and performs iterative calculations on the top of the Hadoop cluster avoiding costly disk and network operations by using out of core in-memory execution. Each cycle of an iterative Giraph calculation on Hadoop runs a full map-reduce job with only mappers.

7

Fig. 4. Proposed DeCom Model

3. The Proposed Model The proposed DeCom model is autoencoder based parallel overlapping community detection model which is designed for large graphs using BSP based Pregel framework. The proposed work can be formalized as: Problem Definition: To partition a graph into k overlapping communities C = ( C1 , C2 , .., Ck ) in l layers, such that the edge cut size between clusters or the communication cost is minimized in the distributed environment. DeCom uses Pregel framework for parallel processing. Pregel exploits learning through multiple layers, therefore is suitable for performing community detection of graphs in unsupervised manner. In Unsupervised learning, the information is sent to a machine with a hope that it will learn something new for which it is not trained. For performing unsupervised community detection using deep neural network, autoencoder based learning is used in DeCom. As shown in Fig. 3, Pregel is appropriate for autoencoder based learning as it performs computations in multiple layers by using output of one superstep as the input for next superstep. Initially, the vertices of the graph can be viewed as neurons and the message communication among vertices can be interpreted as synapsis between neurons. The Pregel framework is appropriate to run recursive algorithms in multiple supersteps. Deep neural network based stacked autoencoder are the building blocks of the DeCom. The model works in three phases as shown in Fig. 4. In the first phase, for a given graph G = {V, E}

the seed nodes are selected. Further, the communities are formed using personalized PageRank starting from each of the selected seed nodes. In the third phase the reconstruction error is minimized using modularity.

All operations are performed using autoencoder. Consider the autoencoder module for DeCom be DA comprising of seed node identification, community expansion, optimization and assignment of vertices to clusters. Let c be the desired number of clusters for each layer li of

8

DA and k be the final number of clusters. Let the layers of DA be l1 , l2 , ....., l x . For each layer, input Xl is computed using: where, ml =

Number o f Connections . 2

Xl = DA,l (Xl−1 − ml )

(4)

Let xi be the input vector of input layer, hi be the hidden

vector of hidden layer and fi be the output vector of output layer. Consider Ho and Fo as the activations of the hidden and output layer respectively. The hidden vector hi and output vector fi are computed using: hi = Ho (W xi + a)

and

fi = Fo (Xhi + b)

(5)

The set of parameters {θ1 , θ2 ) = {W, a, X, b} are to be learned in the layers. The goal is to minimize the difference between the reconstructed fi and input xi passing through the hidden embedding hi . The reconstruction error is computed as: Loss(θ) = minθ =

N X i=1

|| fi − xi ||2

(6)

The activation function can be sigmoid or tanh. The training process is iterative and during each iteration two tasks are performed, vertex value update and merge update. During the vertex value update task, each worker first updates the value of vertices in the local graph according to the received messages from the master node. Then the local computation is performed to alter the vertex values and labels in the assigned data partitions and finally send the updated vertex values to the master node. In the merge update task, the master node updates the value and label of the vertices according to the messages received from the worker nodes. Then it distributes the updated data to all the worker nodes. Both tasks are repeated alternatively until the specified numbers of iterations are performed or any termination criterion is met. During each iteration, the computation is performed on each worker node and then the final aggregation is performed on master node. The phase-wise functioning of DeCom is shown in Fig. 4. The implementation details of each phase are given below in subsequent phases. 3.1. Phase 1: Seed Node identification In the first phase, the characteristics of the vertices are learned. The learning process starts with the initialization of the variable. This phase primarily involve greedy layer-wise training. Given a graph G having N nodes and the degree matrix D. We consider D as the training set containing N instances d(v1 ), d(v2 ), . . . .., d(vN ) such that d(vi ) ∈
From N vertices of graphs, c vertices are selected as seed nodes. Here basic autoencoder is

embedded by setting the target values V 0 to be equivalent to the input values V. The autoencoder tries to learn the values associated with the vertices. Here, the enoder map the data degree matrix D to a lower dimension matrix C=[ci j ] ∈
Algorithm 1: Phase 1: Seed Node Identification (DeCom) Data: Graph G with edges(vi , v j ) Result: Candidate Seeds 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Function VertexCompute(vertex, Msg) if superstep = 0 then Value(vertex ← NumEdges); SendMsg → Edges(vertex, 0) ; end else if superstep=1 then InDegree=0; for Msg ∈ Messages do InDegree = Degree+1 ; OutDegree = value(vertex) ; Degree=InDegree +OutDegree; Value(vertex) = Degree; end end else if superstep = 2 then while v ← Msgi do // Sort vertices according to degree value ; sort( v.list, Msgi ); end end

nodes. A good seeding approach should select well-distributed seeds from the network, such that after expansion of clusters they will have high coverage. A greedy and effectual layer-bylayer approach is used in the proposed deep neural network based DeCom model to learn the influence of each vertex in the network in hidden layers. A set of c vertices are selected in a greedy manner by analyzing vertices by their decreasing degree values. The degree of vertex vi P is defined as di = j ei j . It is expected that vertices with high degree have shorter distances in

comparison to other vertices inside a community.

It also has been noticed in many networks that the neighbors of highly connected vertices

have high degrees. Thus for this heuristic, the vertices adjacent to the selected seed vertex are not considered as candidate seed. In the case where two or more adjacent vertices have same degree, one of the vertex is selected randomly and the rest are discarded from the candidate seed list. Therefore, in this phase, the parameters to be learned are the candidate seed nodes ci and the optimization is performed by not letting two adjacent vertices as the candidate seed nodes. 3.2. Phase 2: Learning Topology The goal of this phase is to obtain multiple communities in a completely unsupervised manner, each defined by the seed node ci . Once a list of seed nodes is ready, we can start the process of community detection. The communities are expanded from each selected seed node using 10

personalized PageRank algorithm. The algorithm starts lazy random walk from seed node ci , and visit the neighbor vertex vi randomly with probability 21 (1 − α) or go back to ci with

probability α.

The Personalized PageRank transition matrix P s is given by following equation: P s = (1 − α)ci + αM

(7)

where, M is the adjacency list. The parameters used in personalized PageRank α is set to be 0.85 [22]. The detailed steps are explained in algorithm 2. A community includes the seed node and several other vertices. The prominent center spreads the influence to its neighbor vertices in multiple iterations. The vertices that are closer to the seed node also have high influence. The cluster center ci transmits its impact with label information to its adjacent neighboring vertices. The neighbors then pass their impacts to their adjacent vertices in further iterations. During each iteration, vertices classify the received messages according to the labels. In further iterations, each vertex creates new messages and sends them to its neighbors. The process continues until the specified number of supersteps is executed. Finally, sum of messages having the same label a is calculated by adding the P number of edges in a random walk (Ca , va1 , va2 , ...., Vap ) using S uma = mi=1 Edgea , where m is

the number of messages with label a and Edgea is the edge carrying the message with label a. S uma computed to know the number of vertices in a community.

The time for detecting a community will be more for big communities in comparison to the small communities. Let π s be the stationary distribution of P s . The vertices are allotted to communities using: Community[Ci ] = arg max s∈K π s (v)

(8)

Here, s is the seed node and π s (v) constitutes the vertices in the personalized PageRank. Being an unsupervised learning model for large graphs having millions of vertices and edges, DeCom skips the costly Back-propagation method [30] in learning topology phase and simply pass the output of one layer to the next layer. Thus, making the whole training strategy hybrid by providing procreative performance and also making an improvement in the discriminative capability of the network. Thus, in this phase, encoder transform the input graph into lower dimension matrix P s and the parameteres θi are learned by analyzing the stationary distribution πs. 3.3. Phase 3: Learning Refinements The formed clusters are refined using modularity optimization. The modularity maximization can be instinctively interpreted as to find better representations by reconstruction of the network structure. Further, the seed nodes are reselected based on distances in between the vertices of the community. It should be noted that both of optimization steps only reconstruct original networks by linear reconstruction only. 11

Algorithm 2: Phase 2: Learning Topology Data: Selected seed nodes Result: Formed Communities 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Function Compute(Msg,superstep) while superstep ≤ 40 do superstep++; while v ← Msgi do if vi == ci then S umi = 0 value = 0.15 ∗ vi + 0.85 ∗ S umi ; end else value = 0.85vi ∗ S umi ; end v.label = labeli end sendMsg(value) end while superstep ≥ 40 do while v ← Msgi do voteToHalt() ; sendMsg( Communities); Aggregate (vertices, seed nodes) end end

Algorithm 3: Phase 3: Learning Refinements Data: Formed Communities C[k] Result: Final seed nodes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

while Superstep ≤ 70 do for each ci ∈ Ck do while Modularity(ci ) ≤ threshold do vertices(ci ) → neighbours(ci ) end end while Ck , 0 do for each vi ∈ Ck do minDistance=min(message); sendMessage(Ck , minDistance); newCk = C j ; // Using equation (11) end end Aggregate(C j , Vertices); end

12

(a)Optimization of Communities The formed clusters are assessed by the fuzzyfication of the modularity function Q f uzzy from the equation (2). The community with low Modularity tends to have more edges outside the community in comparison to the edges within the community. The conflict of the belongingness of vertices in communities having low modularity towards the neighbor cluster is solved by allocating such vertices to the neighbor communities in which they have an outgoing edge. It is accomplished in an iterative manner by arranging the generated communities according to their Fuzzy Modularity value Q f uzzy . The master iterates through the communities and adds the communities with high fuzzy modularity to the final list of communities. When a cluster q with low modularity appears to master, it informs the worker to allocate the vertices (v1 , v2 , . . . , vn ) of that community to other neighboring communities (c1 , c2 , . . . , cn ) to which vertices are connected. Workers re-assign all such vertices to neighboring clusters and report the master with the new list of updated clusters. Clusters with lower modularity are thus vanished and removed from the cluster list. Formally, we can represent the modularity optimization phase in verctor form. Let the modularity matrix D=[qi j ]∈
deg(i)∗deg( j) . m

The modularity matrix D is

the input of the corresponding autoencoder layer. The encoder map the data D to a lowerdimensional matrix B=[bi j ]∈ < p∗N where, p < N. Such that any ith column will be equal to:

bi = f (qi ) = Ho (WB ti + pH )

(9)

where, WB ∈ < p∗N and pH ∈ < p∗1 are the parameters to be learn. The parameters will be learned from the change in modularity value ∆Q f uzzy of the clusters to which vertices are added. It is given as: ∆Q f uzzy

1 = m

" X j∈p

av j − deg(v)

P

j∈p

P ! ! X degr( j) j∈q degr( j) av j − degr(v) − 2m 2m j∈q

(10)

The activation function Ho is an element-wise non-linear mapping. Decoder converts the latent representation B=[bi j ]∈ < p∗N to the original data form D=[qi j ]∈
Table 2 Dataset Description

Source Facebook Twitter Amazon GPlus LiveJournal Orkut

Number of Vertices 4,039 81,306 334,863 107,614 3,997,962 3,072,441

Number of edges 88,234 1,768,149 925,872 13,673,453 34,681,189 117,185,083

Type Undirected Directed Undirected Directed Undirected Undirected

Layer Configuration 4039-403-380 81306-8130-7891 334863-83486-78291 107614-12639-7392 3997962-399796-290187 3072441-894753-787145

The parameter of the encoder {W, a} are learned by computing the distance of vertices in a

cluster. To better estimate the impact of seed node on all the vertices of community, Manhattan

distance is used to compute the distance because it is not a squared function and is less sensitive P to noise. Manhattan distance is calculated as d(vi , v j ) = Ck=1 |vik − v jk |. The master node has the aggregated list of all the communities with the assigned vertices. In the next superstep, worker nodes receive a list from the master and are instructed to calculate the distance of each

vertex vi from v j in the cluster cv . Vertices within the same community communicate by passing messages telling others about their distance from the seed vertex. Worker nodes use combiners to minimize the number of messages. Therefore, each vertex needs to evaluate only some part of the received message. In further iterations, worker nodes calculate the new seed vertices for all the communities by calculating the mean of n vertices allocated to community ci using: n

1X vi ci = n j=1

(11)

The decoder reproduces the original graph with formed communities. The value of the vertex is reconstructed by associating it with the constituting clusters. 3.4. Complexity Analysis of DeCom Consider n be the number of vertices in network and z be the number of edges. Let n’ be the average number of vertices in one community. Time complexity of each step of proposed DeCom Model is given below: • Initialization of n vertices in l layered autoencoder will take O(l(n)) time. • Selecting seed node will take O(z) for degree computation and O(nlog(n)) for arranging degrees in ascending order. • Formation of k communities having on average O(n’) nodes will take O(n+z) while per-

sonalized Pagerank and O(klog(n0 )) for modularity optimization. • Selecting final cluster heads will take O(k(n’)) time. 4. Experimental Evaluation

A series of experiments were performed to quantify the efficiency of DeCom in comparison to Label Propagation Algorithm (LPA), SLPA, FCM and Pregel’s Semi-Clustering (S-Cl). The 14

500 400

600

300

800

R-Seeds DeCom FCM LPA SLPA S-Cl

700 Run Time(s)

600

500 400

600 500 400 300

200

200

200

100

100

100

0

0

20 30 40 50 60 Number of Supersteps

70

0

10

10

20 30 40 50 60 Number of Supersteps

(d)

0

70

0

10

(b)

R-Seeds DeCom FCM LPA SLPA S-Cl

0

20 30 40 50 60 Number of Supersteps

1,200 1,100 1,000 900 800 700 600 500 400 300 200

70

R-Seeds DeCom FCM LPA SLPA S-Cl

0

10

20 30 40 50 60 Number of Supersteps

20 30 40 50 60 Number of Supersteps

70

(c)

1,400 1,300 1,200 1,100 1,000 900 800 700 600 500 400 300 200

R-Seeds DeCom FCM LPA SLPA S-Cl

Run Time(s)

10

(a)

Run Time(s)

1,100 1,000 900 800 700 600 500 400 300 200 100 0

0

R-Seeds DeCom FCM LPA SLPA S-Cl

700

300

Run Time(s)

Run Time(s)

800

R-Seeds DeCom FCM LPA SLPA S-Cl

700

Run Time(s)

800

70

0

10

(e)

20 30 40 50 60 Number of Supersteps

70

(f)

Fig. 5. Number of Supersteps vs Run Time on (a) Facebook (b) Twitter (c) Amazon (d) GooglePlus (e) LiveJournal (f) Orkut

aforementioned competent algorithm are considered because they can be efficiently employed on the same distributed environment as of DECOM. The number of communities must be predefined in these algorithms. Hence, we considered the same number of communities computed by DeCom for comparison analysis. We also considered a case when the seed nodes are identified randomly in Phase 1 of DeCom and named it ’R-Seeds’. Thus to find the effect of selecting seeds based on degree, the performance of DeCOM is also compared with R-Seeds The experiments were performed extensively to evaluate the efficiency of the algorithm on 8 dell machines forming a Hadoop cluster each with 8GB of RAM, 1024 GB hard disk, having Ubuntu Linux OS installed. All the machines were connected via a gigabyte network. One node was considered as master node, and rest were considered as slave nodes. We used Hadoop 2.6.0 and Giraph 1.1.0 for our experiments. We performed experiments over six real life network datasets to illustrate the strengths and weakness of DeCom. Table 2 lists various datasets of large networks used in experiments taken from Stanford large Network dataset collection [19]. Three stacked autoencoder are used for each phase such that the dimension of the latent space of each encoder is less than that or its input and output. The layer configurations for all the networks is given in Table 2. We considered degree matrix as the input to the first autoencoder and trained it to find the candidate seed nodes. The embedding result is taken as the input to the second autoencoder and so on. The training of all the autoencoder is done separately. The maximum number of iterations to be performed on each autoencoder is considered as 30. For evaluation of FCM [37], the fuzzyfication index m is considered as 2 and the epsilon  as le − 3 [27].

15

Fig. 6. Phase-Wise Run-Time of DeCom

4.1. Run Time and Number of Supersteps Run Time of an algorithm is the time taken by a system to execute that algorithm. The total run time to execute the proposed model includes both the time spent on loading the graph to Giraph and the time to run DeCom. The communication overhead is reduced by using combiners of Giraph. Combiners aggregate the messages prior to transfer in each superstep. Combiners in DeCom aggregate all the PageRank values received from a worker in the superstep and forward it as a single grouped message. The lesser communication overhead here results in faster computation as shown in Fig. 5. The computations involved in Fuzzy C-means, Label propagation, SLPA, Semi-Clustering and R-Seeds do not involve finding the number of communities. Despite of analyzing the structure of the network for finding the number of communities, the proposed DeCom achieved optimal time efficiency in comparison to the other algorithms. The initialization based on degree resulted in lesser number of iterations for DeCom. R-Seeds also perform better than the other state-of-art algorithms. As DeCom performs CPU bounded computations, a linear run time is observed from Fig. 5. For all the datasets, DeCom finish the computation in lesser number of supersteps then the other algorithms. DeCom also takes the noticeable advantage of the felicitous selection of seed vertices over the randomly selected version R-Seeds. The proposed DeCom is an order of magnitude efficient in terms of run-time than semiclustering, FCM, LPA and SLPA for GooglePlus, LiveJournal and Orkut datasets. For Orkut dataset, FCM even not complete its execution in 70 supersteps as shown in Fig.5(f). The phasewise run time of DeCom is given in Fig. 6. As shown, the major time is taken by the phase 2: cluster formation phase. The first phase involves initialization and candidate seed selection whereas; the third phase performs computations for final cluster seed selection. Thus, take lesser time in comparison to the second phase.

16

Run Time(s)

3,000 2,500 2,000 1,500

4,000

R-Seeds DeCom FCM LPA SLPA S-Cl

3,500 3,000 2,500 2,000 1,500

4,000 3,000 2,500 2,000 1,500

1,000

1,000

1,000

500

500

500

0

0

1

2

3 4 5 6 7 Number of Workers

8

0

1

2

(a)

0

1

2

3 4 5 6 Number of Workers

0

8

0

1

2

(b)

7

8

5,000 4,500 4,000 3,500 3,000 2,500 2,000 1,500 1,000 500 0

Run Time(s)

R-Seeds DeCom FCM LPA SLPA S-Cl

Run Time(s)

5,000 4,500 4,000 3,500 3,000 2,500 2,000 1,500 1,000 500 0

3 4 5 6 7 Number of Workers

R-Seeds DeCom FCM LPA SLPA S-Cl 0

(d)

1

2

3 4 5 6 Number of Workers

3 4 5 6 Number of Workers

7

8

7

8

(c)

5,000 4,500 4,000 3,500 3,000 2,500 2,000 1,500 1,000 500 0

Run Time(s)

0

R-Seeds DeCom FCM LPA SLPA S-Cl

3,500 Run Time(s)

R-Seeds DeCom FCM LPA SLPA S-Cl

3,500

Run Time(s)

4,000

7

8

(e)

R-Seeds DeCom FCM LPA SLPA S-Cl 0

1

2

3 4 5 6 Number of Workers

(f)

Fig. 7. Scalability: Number of workers vs Run-Time on (a) Facebook (b) Twitter (c) Amazon (d) GooglePlus (e) LiveJournal (f) Orkut

4.2. Scalability and Speedup To analyze the scalability and Speedup behavior of DeCom, we varied the number of available workers P from one to eight in Giraph job. In Fig. 7, the scalability of R-Seeds, DeCom, FCM, label propagation (LPA), SLPA and semi-clustering algorithm is shown using up to eight nodes. As anticipated, increase in the number of processors results in decreased run time. With added worker nodes, DeCom performs well in terms of Run-Time. For 8 Worker nodes, DeCom is two order of magnitude faster than the FCM and an order of magnitude faster than LPA and semi-clustering algorithms for all the datasets. DeCom is also more efficient than R-seeds in terms of Run-time. The scalability of the algorithm is measured in terms of Speedup. Speedup measures the efficiency of the parallel algorithm when the number of processors are increased. It can be calculated as: S peedup =

RunT ime on 1 proc. P ∗ RunT ime on P proc.

(12)

Where, P is the number of processors used and proc. is for processor. For ideal parallelization, Speedup = 1. DeCom have speedup of around 0.67 for the LiveJournal network and 0.64 for the GooglePlus network. We used only eight processors here. If the number of processor is further increased, DeCom could even be resulted in higher value of Speedup. The speedup of DeCom is 21%, 9%, 24%, and 40% better than SLPA, Semi-clustering, LPA and FCM respectively.

17

(a)

(b) Fig. 8. Quality Assessment (a) Modularity: Higher bar indicates better Community detection (b) Conductance: Lower bar indicates better Community Detection

Fig. 9. Speedup of Proposed DeCom

18

4.3. Community Quality Assessment The efficiency of the DeCom is measured based on modularity and conductance for finding overlapping communities. DeCom optimize the communities using modularity in the learning refinements phase. The modularity comparison of proposed DeCom and existing algorithms is shown in Fig. 8(a). The high modularity value of DeCom for all the considered networks indicate that it discovers the overlapping communities efficiently then the state-of-art algorithms. As shown, DeCom is an order of magnitude efficient in terms of modularity than the fuzzy-cmeans clustering algorithm. The clusters produced by proposed DeCom have an average 10% and 15% higher modularity than the clusters produced by LPA and semi-clustering respectively. Fig. 8(b) shows the performance of DeCom model along with other state-of-art algorithms in terms of conductance. It can be defined as the ratio of number of inter-cluster edges to the minimum number of edges incident on either cluster Ck or Ck0 : P i∈Ck , j∈Ck0 Ai j Conductance(Ck ) = min(A(Ck ), A(Ck0 ))

(13)

P P P P where, Ai j is the adjacency matrix of graph G, Ck ∈ V and A(Ck )= i∈Ck j∈V Ai j − i∈Ck j∈Ck Ai j

is the number of edges incident on Ck . Lower is the value of conductance of a cluster lesser are the intra-cluster edges. DeCom performs well in terms of conductance also in comparison to

all the other algorithms for all the datasets. The clusters produced by DeCom have on average 21%, 13% and 34% better in terms of conductance than FCM, LPA and semi-clustering algorithm respectively. Thus, it is observed that DeCom gains noticeable advantage over the other competent algorithms. 4.4. Coverage The performance of proposed DeCom is compared with the state-of-art clustering algorithms in terms of network coverage. Fig. 10 shows the returned coverage in percentage. Graph coverage indicates the amount of vertices allocated to clusters. For example, for 90% graph coverage, 10% vertices are not assigned to any of the cluster. In certain applications, the loss of vertices from the output may lead to loss of crucial information. FCM form clusters by computing mean of the distance among the vertices. Hence, the graph coverage is 100% in the case of FCM. LPA performs clustering by passing labels. The vertices that do not get any label are not added in any cluster. SLPA is an improved version of LPA but the coverage is not 100% in SLPA. DeCom covers all those vertices which fall in the personalized PageRank walk. DeCom recovers the lost vertices by minimizing the reconstruction error and try to include all those vertices in final output also which are not covered by personalized PageRank. Therefore, the coverage of proposed DeCom is better in comparison to SLPA, LPA and Semi-clustering.

19

Fig. 10. Coverage of Clustering algorithms in %

Conclusion In this paper, a deep neural network based model for detecting overlapping communities named ’DeCom’ is proposed. It is designed especially for handling large networks in an unsupervised manner, where the number of communities is not pre-defined. A layer-wise stacked autoencoder is used for finding seed nodes and for adding vertices to the communities based on the structure of the network. Hadoop and Pregel frameworks are used for the scalable implementation of DeCom, which make the whole learning process adaptable. Experimental results on various real-world network datasets have shown that DeCom performs well in terms of processing time, when compared with the state of art algorithms Label Propagation, FCM and Semi-clustering. The scalability and coverage of the proposed model is also observed. Accuracy of the proposed DeCom is measured using Conductance and Modularity. It is proved that DeCom finds the clusters of varying sizes with high accuracy and achieves better clustering results. In future, the model can be tested with graphs having billions of vertices using more number of machines of higher configurations and can be optimized to reduce the computational cost. References [1] Bai, X., Yang, P., Shi, X., 2017. An overlapping community detection algorithm based on density peaks. Neurocomputing 226, 7–15. [2] Bedi, P., Sharma, C., 2016. Community detection in social networks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 6 (3), 115–135. [3] Bello-Orgaz, G., Hernandez-Castro, J., Camacho, D., 2017. Detecting discussion com20

munities on vaccination in twitter. Future Generation Computer Systems 66, 125–136. R in [4] Bengio, Y., et al., 2009. Learning deep architectures for ai. Foundations and trends

Machine Learning 2 (1), 1–127.

[5] Bezdek, J. C., Ehrlich, R., Full, W., 1984. Fcm: The fuzzy c-means clustering algorithm. Computers & Geosciences 10 (2), 191–203. [6] Borgatti, S. P., 2005. Centrality and network flow. Social networks 27 (1), 55–71. [7] Cao, S., Lu, W., Xu, Q., 2016. Deep neural networks for learning graph representations. In: AAAI. pp. 1145–1152. [8] Dean, J., Ghemawat, S., 2008. Mapreduce: Simplified data processing on large clusters. Communications of the ACM 51 (1), 107–113. [9] Deng, L., Yu, D., Platt, J., 2012. Scalable stacking and learning for building deep architectures. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 2133–2136. [10] Golsefid, S. M. M., Zarandi, M. H. F., Bastani, S., 2015. Fuzzy community detection model in social networks. International Journal of Intelligent Systems 30 (12), 1227– 1244. [11] Gregory, S., 2010. Finding overlapping communities in networks by label propagation. New Journal of Physics 12 (10), 103018. [12] Hai-peng, C., Xuan-Jing, S., Ying-da, L., Jian-Wu, L., 2016. A novel automatic fuzzy clustering algorithm based on soft partition and membership information. Neurocomputing 236, 104–112. [13] Havens, T. C., Bezdek, J. C., Leckie, C., Ramamohanarao, K., Palaniswami, M., 2013. A soft modularity function for detecting fuzzy communities in social networks. IEEE Transactions on Fuzzy Systems 21 (6), 1170–1175. [14] Hu, P., Chan, K. C., He, T., 2017. Deep graph clustering in social network. In: Proceedings of the 26th International Conference on World Wide Web Companion. pp. 1425–1426. [15] Hutchinson, B., Deng, L., Yu, D., 2013. Tensor deep stacking networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8), 1944–1957. [16] Kajdanowicz, T., Indyk, W., Kazienko, P., Nov 2014. Mapreduce approach to relational influence propagation in complex networks. Pattern Analysis and Applications 17 (4), 739–746. URL https://doi.org/10.1007/s10044-012-0294-6 21

[17] Kuzmin, K., Chen, M., Szymanski, B. K., 2015. Parallelizing slpa for scalable overlapping community detection. Scientific Programming 2015, 4. [18] LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521 (7553), 436–444. [19] Leskovec, J., Krevl, A., June 2014. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data. [20] Li, W., Jiang, S., Jin, Q., 2018. Overlap community detection using spectral algorithm based on node convergence degree. Future Generation Computer Systems 79, 408–416. [21] Ludwig, S. A., 2015. Mapreduce-based fuzzy c-means clustering algorithm: implementation and scalability. International Journal of Machine Learning and Cybernetics 6 (6), 923–934. [22] Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., Czajkowski, G., 2010. Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, pp. 135–146. [23] Martella, C., Shaposhnik, R., Logothetis, D., Harenberg, S., 2015. Practical graph analytics with apache giraph. Springer. [24] Nepusz, T., Petr´oczi, A., N´egyessy, L., Bazs´o, F., 2008. Fuzzy communities and the concept of bridgeness in complex networks. Physical Review E 77 (1), 016107. [25] Newman, M. E., 2006. Modularity and community structure in networks. Proceedings of the national academy of sciences 103 (23), 8577–8582. [26] Newman, M. E., Girvan, M., 2004. Finding and evaluating community structure in networks. Physical review E 69 (2), 1–15. [27] Pal, N. R., Bezdek, J. C., 1995. On cluster validity for the fuzzy c-means model. IEEE Transactions on Fuzzy systems 3 (3), 370–379. [28] Raina, R., Madhavan, A., Ng, A. Y., 2009. Large-scale deep unsupervised learning using graphics processors. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp. 873–880. [29] Shao, M., Li, S., Ding, Z., Fu, Y., 2015. Deep linear coding for fast graph clustering. In: Proceedings of the 24th International Conference on Artificial Intelligence. AAAI Press, pp. 3798–3804. ˇ ıma, J., 1996. Back-propagation is not efficient. Neural Networks 9 (6), 1017–1023. [30] S´ [31] Song, C., Huang, Y., Liu, F., Wang, Z., Wang, L., 2014. Deep auto-encoder based clustering. Intelligent Data Analysis 18 (6S), S65–S76. 22

[32] Song, C., Liu, F., Huang, Y., Wang, L., Tan, T., 2013. Auto-encoder based data clustering. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Springer, pp. 117–124. [33] Sun, K., Zhang, J., Zhang, C., Hu, J., 2016. Generalized extreme learning machine autoencoder and a new deep neural network. Neurocomputing 184, 232–242. [34] Tian, F., Gao, B., Cui, Q., Chen, E., Liu, T.-Y., 2014. Learning deep representations for graph clustering. In: Proceedings of 28th AAAI Conference on Artificial Intelligence. pp. 1293–1299. [35] Wang, X., Liu, G., Pan, L., Li, J., 2016. Uncovering fuzzy communities in networks with structural similarity. Neurocomputing 210, 26–33. [36] Wang, Y., Yao, H., Zhao, S., 2016. Auto-encoder based dimensionality reduction. Neurocomputing 230, 374–381. [37] Wu, K.-L., 2012. Analysis of parameter selections for fuzzy c-means. Pattern Recognition 45 (1), 407–415. [38] Xie, J., Girshick, R., Farhadi, A., 2016. Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning. pp. 478–487. [39] Xie, J., Kelley, S., Szymanski, B. K., 2013. Overlapping community detection in networks: The state-of-the-art and comparative study. Acm computing surveys (csur) 45 (4), 43. [40] Xie, J., Szymanski, B. K., Liu, X., 2011. Slpa: Uncovering overlapping communities in social networks via a speaker-listener interaction dynamic process. In: Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on. IEEE, pp. 344–349. [41] Yoon, S.-H., Kim, K.-N., Hong, J., Kim, S.-W., Park, S., 2015. A community-based sampling method using dpl for online social networks. Information Sciences 306, 53–69. [42] Zhang, J., Ge, S., 2012. A parallel algorithm to find overlapping community structure in directed and weighted complex networks. In: Second International Conference on Instrumentation, Measurement, Computer, Communication and Control (IMCCC). IEEE, pp. 1561–1564. [43] Zhang, K., Chen, X.-W., 2014. Large-scale deep belief nets with mapreduce. Access, IEEE 2, 395–403.

23

Author Biography

Vandana Bhatia is a Research Scholar in Computer Science and Engineering Department at Thapar University, Patiala, India. Her research interests include Big Data analytics, Graph Mining and soft Computing. She received a Masters in Technology in Computer Engineering from Kurukshetra University. She has contributed 6 research articles in Journals and conferences. She is also a member of ACM and IEEE. Contact her at [email protected].

Dr.Rinkle Rani is working as Associate Professor in Computer Science and Engineering Department, Thapar University, Patiala. She has done her Post graduation from BITS, Pilani and Ph.D. from Punjabi University, Patiala. She has more than 20 years of teaching experience. She has supervised 3 Ph.D and 43 M.Tech. Dissertations and contributed 54 articles in Conferences and 47 papers in Research Journals. Her areas of interest are Computer Networks and Big data mining and Analysis. She is member of professional bodies: ACM, IEEE, ISTE and CSI. She may be contacted at raggarwal@ thapar.edu

Highlights 1. Proposed a parallel overlapping community detection model by leveraging autoencoder pipelines for large graphs. 2. It finds the number of communities by analyzing the structure of the graph. 3. DeCom scales up well to handle large graphs. 4. DeCom Outperforms the competent algorithms in terms of quality and processing time.