GA-LP: A genetic algorithm based on Label Propagation to detect communities in directed networks

GA-LP: A genetic algorithm based on Label Propagation to detect communities in directed networks

Accepted Manuscript GA-LP: A genetic algorithm based on label propagation to detect communities in directed networks Rodrigo Francisquini, Valerio Ro...

8MB Sizes 0 Downloads 32 Views

Accepted Manuscript

GA-LP: A genetic algorithm based on label propagation to detect communities in directed networks Rodrigo Francisquini, Valerio Rosset, Maria´ C.V. Nascimento ´ PII: DOI: Reference:

S0957-4174(16)30691-1 10.1016/j.eswa.2016.12.039 ESWA 11056

To appear in:

Expert Systems With Applications

Received date: Revised date: Accepted date:

13 May 2016 29 October 2016 9 December 2016

Please cite this article as: Rodrigo Francisquini, Valerio Rosset, Maria´ C.V. Nascimento, GA-LP: A ´ genetic algorithm based on label propagation to detect communities in directed networks, Expert Systems With Applications (2017), doi: 10.1016/j.eswa.2016.12.039

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • GA-LP presented results competitive with the current literature

CR IP T

• GA-LP refines the results of the well-known Label Propagation

• The time to detect communities in large directed networks by GA-LP is

AC

CE

PT

ED

M

AN US

very low

1

ACCEPTED MANUSCRIPT

GA-LP: A genetic algorithm based on label propagation to detect communities in directed networks

CR IP T

Rodrigo Francisquinia,1 , Val´erio Rosseta,1 , Mari´ a C. V. Nascimentoa,1 a Instituto de Ciˆ encia e Tecnologia, Universidade Federal de S˜ ao Paulo (UNIFESP), Av. Cesare M. G. Lattes, 1201, Eugˆ enio de Mello, S˜ ao Jos´ e dos Campos-SP, CEP: 12247-014, Brasil

Abstract

AN US

Many real-world networks have a topological structure characterized by cohesive

groups of vertices. To perform the task of identifying such subsets of vertices, community detection in networks has aroused the interest of researchers and practitioners alike. In spite of the existence of various efficient community detection algorithms in the literature, most of them uses global information about the network, not applicable to distributed networks. This paper proposes a

M

genetic-based algorithm to detect communities in directed networks based on local information to generate the offspring. The major difference between the

ED

proposed strategy and those found in the literature is the way of exploiting target regions of interest in the solution space. This step is directly influenced by the crossover operator that depends largely on the individual representation. In

PT

the introduced strategy, GA-LP, the individual is locally stored in the vertices as labels, what brings more flexibility in the system to be adapted to address

CE

applications that involve, for example, dynamic networks. In computational experiments, the proposed strategy showed an outstanding performance, being fast, achieving the best results on average in the networks tested.

AC

Keywords: label propagation; genetic algorithm; community detection problem

Email addresses: [email protected] (Rodrigo Francisquini ), [email protected] (Val´ erio Rosset ) URL: [email protected] (Mari´ a C. V. Nascimento )

Preprint submitted to Expert Systems with Applications

January 11, 2017

ACCEPTED MANUSCRIPT

1. Introduction A wide range of elements present in our daily life can be represented by means of a graph. Mostly, these real-world networks1 have a non-trivial topology, that

CR IP T

is the key reason behind the description of their vertices through the analysis of their structure. Accordingly, to describe these networks, scholars have classified

their study in three different levels: microscopic, mesoscopic and macroscopic. The former regards individual properties of the vertices, such as the vertex

degree, centrality, clustering tendency (Boccaletti et al., 2006). The second, focused in this paper, refers to the community structure of the network, as a

AN US

form of collectively investigating groups of vertices. The third level aims at the analysis of the network as a whole, by identifying its degree distribution and correlations, to mention a few (Strogatz, 2011).

There exist many tools to identify groups of densely connected vertices (clusters). Detecting communities (also known as clusters) in networks is a common

M

way to infer about the graph vertices. This problem was named after sociological inspiration, the social networks that are composed of communities of individuals. They are also examples of complex networks as well as the widely

ED

used social networks found in the internet. Among the existing algorithms to perform the community detection task,

PT

we can highlight those that optimize assessment measures of vertex partitions (Newman and Girvan, 2004; Rosvall and Bergstrom, 2008; Malliaros and Vazirgiannis, 2013). Brandes et al. (2008) proved that a decision version of the

CE

modularity maximization problem is N P-complete. Therefore, this is a combi-

natorial problem and heuristics are the most explored methods to solve it. Even though a few studies attempt to find exact solutions for this problem, they are

AC

limited to networks with a few hundreds of vertices. In the case of directed networks, some special traits must be observed for the

analysis of the topological features of the network. For example, link reciprocity 1 In

this paper, the terms graphs and networks are used indistinctly.

3

ACCEPTED MANUSCRIPT

is a correlation metric that assesses the tendency of two vertices in the network to have asymmetric arcs between them (Garlaschelli and Loffredo, 2004). According to the authors, this assessment is of significant importance, since it

CR IP T

controls the flow within a network and measures if there is a balance of data propagation through the network. For example, in networks like the world wide

web, if mutual link reciprocity of a group of vertices is relatively large, the flow of information shall be faster than if the link reciprocity is low.

Community detection in directed networks has been approached by a few studies (Malliaros and Vazirgiannis, 2013). A key challenge in the network

AN US

analysis is to define a consistent measure for evaluating the quality of the com-

munities in these networks. According to Fortunato (2010), to neglect the directedness of the arcs is common for detecting communities in directed networks. This practice, however, may not be the most indicated, since, still according to the author, it may conduct to unexpected partitions. Developing strategies specially designed for directed networks is a hard task, due to the asymmet-

M

rical relations that lead to asymmetric matrices and more complex cases. In spite of that, more effort must be done to accomplish specialized algorithms.

ED

Some of the assessment metrics, as the modularity in directed networks (Leicht and Newman, 2008) should be more investigated in order to tackle tackle the community detection problem in such networks.

PT

This paper introduces a genetic-based algorithm that in a distributed scheme performs the task of detecting communities in directed networks. It relies on the

CE

well-known Label Propagation (LP) (Raghavan et al., 2007) algorithm, adapted to approach directed networks, and refined with the introduction of genetic operators. The introduced strategy was named GA-LP. The computational exper-

AC

iments considered artificial LFR networks (Lancichinetti and Fortunato, 2009a) and real networks. Then, this paper shows a comparative analysis between the results of GA-LP and of the state-of-the-art algorithms Infomap (Rosvall and Bergstrom, 2008), LP and Order Statistics Local Optimization Method (OSLOM) (Lancichinetti et al., 2011). The results of the experiments indicate that GA-LP is robust and outperformed the strategies on average. 4

ACCEPTED MANUSCRIPT

2. Related Works This section presents a brief survey of the existing measures to define the community detection problem in directed networks. Additionally, it gives an

CR IP T

overview of the main genetic-based strategies found in the literature to perform the task of graph clustering. Before going into detail about them, next section shows the primary notations and definitions used in this paper. 2.1. Definitions and Notations

In this paper, G = (V, E) is a directed network, where V and E are its sets of

AN US

vertices and arcs, respectively. The elements of V are represented by sequential

numbers from 1 to the number of vertices of V , here denoted as n. An arc from E is a pair (i, j), being i the tail of the arc and j, the head of the arc. The number of arcs in E is m.

Two vertices i and j are adjacent (neighbors) if either (i, j) or (j, i) belongs to E. The out-degree of a vertex, denoted as d+ i , is the number of times a vertex

M

i is tail. The number of times a vertex j is head defines the in-degree of a node i, denoted by d− i . The vertices that belong to an arc with vertex i as head are

ED

called the in-neighborhood of i. The vertices that belong to an arc with vertex i as tail are the out-neighborhood of i. This relational structure, the directed graphs or networks, represents a num-

PT

ber of applications. For example, a network that describes a set of flights may be represented through this combinatorial object, by considering the airports

CE

the vertices of the network and the flights, the arcs between the airports. This representation is very interesting in this case because it enables to infer about the map of flights using graph mining strategies, like the community detection

AC

in networks.

2.2. Community detection in directed networks The unspecific definition of a community is responsible for a significantly

high number of existing measures to evaluate a clustering in a network. Nevertheless, many of them focus on undirected networks. These definitions range 5

ACCEPTED MANUSCRIPT

from very simple studies about the relative connectivity inside and outside communities to a more elaborated study about the topology of the network. In general, the more sophisticated they are, the more reliable the measures. How-

CR IP T

ever, very sophisticated measures usually require a large computational effort to calculate.

More specifically, Malliaros and Vazirgiannis (2013) define a community as a set of vertices with similar characteristics. Therefore, the density of the com-

munities relates the vertices as similar inside the communities in undirected

networks. To mention a few examples of measures that assess partitions tak-

AN US

ing this definition into account are the famous measures suggested in (Radicchi

et al., 2004; Girvan and Newman, 2002). Radicchi et al. (2004) score a community as either weak or strong depending on whether the number of edges between vertices inside communities is higher than the number of edges between communities. Besides inaccurate, a singleton is always interpreted as a strong community. Girvan and Newman (2002) introduced the widely employed

M

modularity measure, that considers not only the number of edges inside communities to define the strength of communities. The number of expected edges

ED

in these communities, considering the null model, in a graph with the same degree sequence as that under consideration, can be calculated in a simplified form through the degree of the vertices inside the communities, and assesses

PT

how strong is a given set of communities. Some of these measures have not a straight redefinition to evaluate the qual-

CE

ity of communities in directed networks in such a way that meaningful graphtheoretic properties hold. Most of these properties regard the flow inside communities that must be ensured. In this case, the absence of link symmetry, the

AC

flow inside communities and other factors related to the direction of the arcs in the communities should be taken into account. For example, Schaeffer (2007) claim that to describe a community as connected, there must exist a directed path between any pairs of vertices. Therefore, as classified in (Malliaros and Vazirgiannis, 2013), another type of community detection algorithms is that based on the connectivity patterns that 6

ACCEPTED MANUSCRIPT

define a cluster. Map equation (Rosvall and Bergstrom, 2008), for example, belongs to this type of algorithms since it is grounded on the flow and information theory. Map equation captures the patterns of flow within the network and the

CR IP T

algorithm proposed by the authors based on this concept to detect communities has presented interesting results for, in particular, directed networks. The

algorithm Infomap that attempts to detect such patterns is very known in the literature by its results and fastness in detecting communities (Lancichinetti and

Fortunato, 2009b). Additionally, Order Statistics Local Optimization Method (OSLOM) presented in (Lancichinetti et al., 2011) is a fitness function that ex-

AN US

presses the statistical significance of communities. A strategy that optimizes this measure was also proposed in (Lancichinetti et al., 2011).

Malliaros and Vazirgiannis (2013) point out the directed versions of modularity (Arenas et al., 2007; Leicht and Newman, 2008) as important measures to define objective functions of heuristic-based methodologies for detecting communities in directed networks. Arenas et al. (2007) generalized the modularity

M

measure to assess the quality of communities of directed networks. For this, the authors observed that to extend the measure to communities, one must consider

ED

the density of arcs. For this, the number of arcs pointing to and being pointed by vertices inside communities must be substantial in comparison to the expected number in a random network with the same in- and out-degree sequence. The

PT

formulation proposed in (Arenas et al., 2007; Leicht and Newman, 2008) for

CE

evaluating the modularity of a vertex partition π is presented in Equation (1).

q(π) =

+ d− 1 X X i dj (aji − ) m m

(1)

∀C∈π i,j∈C

AC

where C is a community from π and aji is the number of arcs with vertex i as head and j as tail. To the extent of our knowledge, a few studies and algorithms can be found

in the literature specially designed for the modularity maximization problem in directed networks, also N P-complete (Brandes et al., 2008). Santos et al. (2016) optimized this measure in their consensus strategy, ConClus, and the

7

ACCEPTED MANUSCRIPT

results achieved were very competitive to the best results found in the literature. However, ConClus required a high computational time to obtain the final

CR IP T

partitions. 2.3. Genetic Algorithms for the Community Detection Problem

Genetic Algorithms (GAs) are efficient search methods based on principles of natural selection and genetics (Goldberg, 1989). GAs rely on a population-based strategy, where individuals mate to produce the next generation of individuals. Additionally, it is related to the Darwinian theory, where the least fit individu-

AN US

als are less likely to survive in the selection process. Consequently, the survivals

are the fittest individuals that reproduce and are expected to form better individuals. The mutation operator enables the diversity of the population and occurs in a few individuals of the population.

GAs have achieved high-quality solutions for a wide range of combinatorial problems, in particular, the community detection problem. In this section, we

M

review some solution methods developed to tackle such problem. Concerning the modularity maximization problem, one may find the studies

ED

proposed in (Shang et al., 2013; Mu et al., 2015). Therefore, they employ the modularity as the fitness function in their genetic algorithm, both hybridized with a simulated annealing strategy. Both studies consider the community de-

PT

tection problem in undirected networks. Additionally, their GAs encode the individuals in a n-dimensional vector, containing in each of its position the label of the community to which that vertex belongs to. On the one hand, the

CE

GA introduced in Shang et al. (2013) uses as prior knowledge the number of communities in the final solution and a two-way crossing as the crossover op-

AC

erator. On the other hand, Mu et al. (2015) suggested the one-point strategy, where only one community from one of the parents is replicated to the offspring, whereas the other communities remained the same as the other parent. The experiments carried out by them employed LFR networks (Lancichinetti et al., 2008). Regarding those networks, Shang et al. (2013) considered networks with 4 balanced communities, each of them with 32 nodes and with mixing parameter 8

ACCEPTED MANUSCRIPT

varying from 0.1 to 0.5. Mu et al. (2015) considered LFR networks with 1000 vertices and with mixing parameter from 0.1 to 0.5. The performance of both algorithms was competitive with other genetic-based algorithms.

CR IP T

Ma et al. (2014) introduced another memetic algorithm to tackle undirected networks. In this algorithm, the local search strategy hybridized with the GA belongs to the class of multi-level learning strategies. The encoding of the chromosomes is the same as in (Shang et al., 2013; Mu et al., 2015). The crossover

has high resemblance with that proposed in (Mu et al., 2015), a two-point strat-

egy. The computational experiments performed on LFR networks with 1000

AN US

vertices and mixing degree ranging from 0.1 to 0.7 show a very competitive performance of the introduced framework regarding the literature.

Other studies can be found in the literature, as in (Gong et al., 2011). Most of them employ the modularity as fitness function, have the same chromosome representation and differ to each other very slightly in the crossover operations. However, we have observed that all studies used networks with a few hundreds

M

of vertices in their experiments. Additionally, they are mostly hybridized with a local search-based strategy, that is very time-consuming. Hence, they do not

ED

perform well on large-scale networks.

This paper investigates a genetic-based algorithm that can deal with largescale networks to refine the well-known Label Propagation algorithm (Raghavan

PT

et al., 2007).

CE

3. GA-based Label Propagation (GA-LP) The prominent Label Propagation (LP) algorithm (Raghavan et al., 2007)

aims at defining communities in an undirected network through the consensual

AC

decision between the labels of the vertices from the neighborhood of the vertices. This algorithm, that does not consider a fitness (objective) function to guide the strategy, has a linear asymptotic complexity. Consequently, LP is able to effectively detect communities in large-scale networks. As LP originally identifies communities in undirected networks, to tackle

9

ACCEPTED MANUSCRIPT

digraphs, the graph transformation approach is one form of using the LP without any adaptation (Malliaros and Vazirgiannis, 2013). This transformation can be very simple, as disregarding the arc directions and treating the directed

CR IP T

network as an undirected graph. However, this transformation, for example, has a major impact in the relation between the community structure and the

communities detected by the tool. As a matter of fact, this transformation, also

known as naive transformation, is very used in studies found in the literature, as

pointed out in (Malliaros and Vazirgiannis, 2013), even though the obvious loss of information after this mapping. The information, in this case, is related to the

AN US

community structure considering the notion of network flow inside communities, that depends on the asymmetric relation of the arcs.

The main ingredients of the GA proposed in this paper, the initial population and the crossover operator, were specially designed to have a low computational time. The primary key of the strategy is to rely on a neighborhood-based search, the label propagation algorithm (Raghavan et al., 2007), to suitably

M

address large-scale networks. Next sections thoroughly discuss the details of

ED

the proposed GA, named GA-LP. 3.1. Chromosomes

This paper introduces a GA whose representation of the individuals can be

PT

encoded as a binary string of fixed length n(n − 1)/2. The binary string is the most commonly used type of encoding in the GAs found in the literature. Accordingly, each position k of the string indicates whether a pair of vertices

CE

belongs to the same community. GAs consider the building blocks scheme in which the propagation of the “good” traits occurs taking into account blocks of

AC

the individuals, interpreted as their units. However, as the introduced strategy is local-based, to avoid it being too

memory-consuming, the representation of the individuals is stored in the vertices of the network. Then, for a given individual, the vertex indicates its corresponding label, an integer value that ranges from 1 to n. Figure 1 displays an example of a set of four chromosomes. 10

CR IP T

ACCEPTED MANUSCRIPT

AN US

Figure 1: This figure represents a set of four chromosomes in a network.

Figure 1 illustrates a network with 8 vertices, each of them named with a capital letter from A to H. The vector next to each node represents the set of labels of that vertex in the 4 individuals of the population. For example, in the first individual, indicated in the first position of the vector of each vertex, the

M

labels of vertices A, B, C, D, E, F, G and H are, respectively, 1, 1, 1, 1, 2, 2, 2 and 2. These vectors are in a matrix form on the right side of the figure. The 0-th column of this matrix, e.g., presents the labels of the first individual of the

ED

population. Therefore, taking into account the data structure, the storage of the individuals is linked to the vertices.

PT

3.2. Initial Population

The initial population of GA-LP is a set of solutions obtained by a simplified

CE

version of the LP algorithm. Unlike LP, that runs until it achieves a consensus between the vertex neighbors, to ensure the diversity of the initial population in GA-LP, independent executions with a single iteration of LP produces each

AC

of the individuals of the initial population. Besides that, an adaptation in this version was necessary because the original version of LP does not consider the arc directions. Algorithm 1 shows the introduced strategy to generate the initial population. In Algorithm 1, Propagate Label id (G) sets as labels of the vertices their own id. It means that, in the beginning of the routine, the vertices are isolated 11

ACCEPTED MANUSCRIPT

Algorithm 1: Direct Simple LP Data: A connected digraph G

C ←Propagate Label id (G); Mark all vertices; Refinement Phase(G, C); Algorithm 2: Refinement Phase Data: A connected digraph G, C

AN US

Result: The refined community

CR IP T

Result: The resulting partition, C

Mark all vertices;

while there exists a marked vertex do

Randomly choose a marked vertex i;

Pick at random one of the labels that repeat the most among the in-neighbors and assign such label to i;

M

Unmark i;

ED

end

communities. Then, a loop starts and at each iteration a vertex is chosen and its label updated with the most common label within its in-neighborhood. As it is

PT

very likely more than one label to repeat the most within the in-neighborhood, the algorithm considers a random choice among them.

CE

Preliminary tests were responsible for us to take the in-neighborhood in this heuristic. By fixing one type of neighborhood, we have observed that the inner flow of the communities was more consistent. Moreover, the algorithm presents

AC

a high diversity in the resulting partitions that will form the initial population. The computational complexity of this algorithm is θ(n). As this strategy aims at detecting communities in large-scale networks, the

initial population could not be numerous, as presented in the section of experiments.

12

ACCEPTED MANUSCRIPT

In GA-LP, the fitness function to classify the individuals of the population is the directed modularity.

CR IP T

3.3. Selection Algorithm GA-LP uses the roulette wheel selection that is a fitness proportionate selection method, widely employed in GA-based strategies. It consists of randomly

selecting the individuals for the recombination process, by considering the probability of selection proportional to the fitness value of the individual. 3.4. Genetic Operators

AN US

The crossover operator here proposed follows a local-based strategy to prop-

agate the labels of the chosen pair of parents. Algorithm 3 presents its main steps.

Algorithm 3: Crossover Data: A connected digraph G, a pair of communities (parents), C1 and C2 chromosome

M

Result: A labeled digraph G, corresponding to the produced

k ← 1; repeat

ED

Unmark all vertices of G;

PT

Randomly choose an unmarked vertex i; Pick at random one of the parents and assign it to C; Store in variable l the label of vertex i in C;

CE

Mark all vertices as unvisited;

AC

Propagate-dfs(G,l,k,i,C);

k ← k + 1;

until there is an unmarked vertex ;

In Algorithm 3, the marks on the vertices refer to the labels of the chro-

mosome resulted from the crossover operation. They are assigned according to the variable k, that sequentially increments as a community has been defined 13

ACCEPTED MANUSCRIPT

according to one of the parents. For this, until all vertices have been marked, the GA-LP algorithm chooses randomly an unmarked vertex i. Then, between the two selected parents, randomly it picks one, here referred as C. The label

CR IP T

of vertex i in parent C is assigned to variable l. Then, a second mark, that

controls whether or not a vertex was visited by the propagation function, is

updated. At every new call of the corresponding function, the algorithm marks all vertices as unvisited, to ensure they and their neighbours are checked every

time a new label is presented to the set of vertices. Then, the propagation of

label l happens accordingly, and when it finishes the variable k is incremented.

AN US

The function Propagate-DFS(G,l,k,i) works as described in Algorithm 4.

The systematic of the label distribution among the vertices is based on the depth-first search (DFS). The input of this algorithm is a starting vertex i and the label to propagate according to a parent C. Then, first, the starting vertex receives the label k, interpreted as the label l of the parent C. This relabeling is to avoid redundant labels through the propagation process since

M

they can be the same in the different parents, but with different sets of members. When all unmarked vertices with label l in the corresponding parent have been

ED

updated with label k, then the strategy halts. It is worth mentioning that once a vertex has been marked, it does not change its label even if it is the member of communities propagated posteriorly to that which firstly marked it.

PT

Figure 2 presents an example of the crossover between parents of population exemplified in Figure 1. The crossover between the parental chromosomes in

CE

positions 0 and 2 of the vectors starts at vertex A. The recombination process randomly selects parent 0 and propagates label 1 according to the members with this label in this parent. Then, besides A, vertices B, C and D receive label 1.

AC

Then, the recombination continues and vertex E is randomly selected among those unmarked vertices (E, F , G and H) as well as parent 2. Then, label 2,

representing label 4, is propagated to those vertices which are members of such community in parent 2 that have not been unmarked yet. The algorithm then halts, since E, F , G and H have label 4 in parent 2 and no unmarked vertex remains. 14

(b) Label Propagation of Parent 0.

PT

ED

M

(a) The choice for the parents.

AN US

CR IP T

ACCEPTED MANUSCRIPT

CE

(c) New random choice, Parent 2 was picked.

(d) Label Propagation of Parent 2.

Figure 2: An illustration of the crossover process of individuals located in positions 0 and 2

AC

of the matrix of the current generation.

15

ACCEPTED MANUSCRIPT

Algorithm 4: Propagate-dfs Data: A connected digraph G, a label l, k and the initial vertex i, the replicated parent C vertex i Mark vertex i with the value of k; forall the neighbor ni of vertex i do if ni is unvisited then Propagate-dfs(G,l,k,ni , C);

AN US

Update ni is as visited;

CR IP T

Result: A connected digraph G with the marks on its vertices and a

if label of ni in C is l and ni is unmarked then Assign to ni as mark label l; end end

M

end

As most of the genetic algorithms designed to solve the community detection

ED

problem, the introduced strategy also present a refinement step. Even though it is not a local search, because it does not explicitly optimize the fitness function during the process, empiric experiments indicated a significant improvement in

PT

the quality of the final solution after incorporating such stage at each produced chromosome. In line with this, Algorithm 2 was applied to every solution found

CE

in the crossover process.

AC

3.5. Substitution Process The substitution process follows the elitism paradigm in which in the new

generation, up to 60% strongest from the new generation replace the 60% weakest chromosomes of the current generation. The exact substitution percentage depends on if some of the 60% strongest individuals has lower fitness function value than any of the 60% weakest chromosomes from the current generation. If so, the fittest individuals are kept to the next generation. 16

ACCEPTED MANUSCRIPT

Figure 3 presents a flowchart of GA-LP. The stop criteria employed in this algorithm is either the maximum number of generations or the maximum number of generations without any improvement. Next section suggests all values

AN US

CR IP T

for these parameters.

M

Figure 3: Flowchart of the GA-LP algorithm.

4. Computational Experiments

ED

This section presents a set of two computational experiments carried out to attest the quality of GA-LP in comparison to the best algorithms in the literature. Before going into detail about them, we present a study about con-

PT

figuration of the parameters of the introduced strategy and which datasets were used to fine-tune the parameters.

CE

4.1. Artificial Networks A pool of artificial networks was generated using the software introduced

AC

in (Lancichinetti and Fortunato, 2009a), known as LFR networks. As the GALP algorithm aims at solving the community detection problem in large-scale networks, besides generating directed graphs with the most usual number of vertices in this type of experiment, between 1000 and 5000 vertices, we also considered networks with 10000 and 50000. Table 1 summarizes the primary parameters employed to produce this pool of LFR networks. 17

ACCEPTED MANUSCRIPT

Table 1: Parameters employed to generate the artificial networks with the software introduced in (Lancichinetti and Fortunato, 2009a).

Values

n

{1000, 5000, 10000, 50000}

µ (mixture parameter)

{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}

d¯G

20 50

neg. exp. dG

2

neg. exp. |C|

1

min/max community size min/max community size

AN US

max dG

CR IP T

Parameter

10/50

20/100

The created set of artificial LFR networks has 5 directed networks for every combination of the values of parameters presented in Table 1. The networks

M

with minimum/maximum value for the community size of 10/50 belong to a class named S, whereas those with minimum/maximum value for the community size

ED

of 20/100 compose the class L. These values for parameters were suggested in (Santos et al., 2016) that presented a recent study on community detection in

PT

directed networks.

4.2. Fine tuning of the GA-LP parameters

CE

The experiment carried out for fine tuning aims at defining the parameters of GA-LP presented next.

AC

• the type of neighborhood to be considered to generate the initial population in the directed LP: in-neighbors, out-neighbors or all neighbors;

• the type of neighborhood used in the propagation to guide the crossover in GA-LP: in-neighbors, out-neighbors or all neighbors; • the size of the population: 5, 10, 15 or 30;

18

ACCEPTED MANUSCRIPT

• the maximum number of generations without improvement: 1, 3, 5 or 10. The closeness to the expected partitions according to their Normalized Mutual Information (NMI) (Danon et al., 2005) is the metric used to evaluate

vertices were used in this experiment.

CR IP T

the quality of the resulting partitions. All networks from the set S with 1000

The results of this experiment combining all parameter values indicated that the best results were:

• the type of neighborhood to be considered to generate the initial popula-

AN US

tion in the directed LP: in-neighbors;

• the type of neighborhood used in the propagation to guide the crossover in GA-LP: in-neighbors; • the size of the population: 5;

M

• the maximum number of generations without improvement: 5. Besides considering the NMI, we evaluated the computational times. There

ED

were some slight differences with regard to the different neighborhoods, but the best overall results achieved with this set of parameters. These values were used

PT

in both experiments presented next.

5. Experiment I

CE

The first experiment considered the LFR networks discussed earlier in this

section. In order to show a comparative analysis of the results achieved by GA-LP, besides its results, those achieved by LP, Infomap and OSLOM are

AC

presented. These community detection algorithms were used in this experiment because they can deal with large-scale networks and have their versions for directed networks. Concerning the NMI values, the closer to 1, the better they are. Figures 4

to 11 display the average results (NMI) of the algorithms of this experiment.

19

60

1.0

ACCEPTED MANUSCRIPT

0.3

10 0

AN US

0.2

CR IP T

Mean Time (sec) 20 30 40 50

0.8 Mean NMI 0.4 0.6 0.0

0.2

GA−LP LP Infomap OSLOM 0.1

GA−LP LP Infomap OSLOM

0.4

µt

0.5

0.6

0.7

0.8

0.1

0.2

0.3

0.4

µt

0.5

0.6

0.7

0.8

0.2

0.3

150 0

0.0

AC

0.2

GA−LP LP Infomap OSLOM

0.1

GA−LP LP Infomap OSLOM

Mean Time (sec) 50 100

PT

CE

Mean NMI 0.4 0.6

0.8

1.0

ED

M

Figure 4: Average NMI and times of networks with 1000 vertices and small-sized communities.

0.4

µt

0.5

0.6

0.7

0.8

0.1

0.2

0.3

0.4

µt

0.5

0.6

0.7

0.8

Figure 5: Average NMI and times of networks with 1000 vertices and large-sized communities.

20

0.1

0.2

0.3

50 0

AN US

0.0

0.2

GA−LP LP Infomap OSLOM

GA−LP LP Infomap OSLOM

CR IP T

Mean NMI 0.4 0.6

0.8

1.0

Mean Time (sec) 100 150 200 250 300

ACCEPTED MANUSCRIPT

0.4

µt

0.5

0.6

0.7

0.8

0.1

0.2

0.3

0.4

µt

0.5

0.6

0.7

0.8

0.2

0.3

400 0

0.0

AC

0.2

GA−LP LP Infomap OSLOM

0.1

GA−LP LP Infomap OSLOM

Mean Time (sec) 100 200 300

PT

CE

Mean NMI 0.4 0.6

0.8

1.0

ED

M

Figure 6: Average NMI and times of networks with 5000 vertices and small-sized communities.

0.4

µt

0.5

0.6

0.7

0.8

0.1

0.2

0.3

0.4

µt

0.5

0.6

0.7

0.8

Figure 7: Average NMI and times of networks with 5000 vertices and large-sized communities.

21

0.1

0.2

0.3

CR IP T

0

0.0

0.2

GA−LP LP Infomap OSLOM

GA−LP LP Infomap OSLOM

AN US

Mean NMI 0.4 0.6

0.8

1.0

Mean Time (sec) 100 200 300 400 500 600

ACCEPTED MANUSCRIPT

0.4

µt

0.5

0.6

0.7

0.8

0.1

0.2

0.3

0.4

µt

0.5

0.6

0.7

0.8

Figure 8: Average NMI and times of networks with 10000 vertices and small-sized communi-

0.2

0.3

1000 200 0

0.0

AC

0.2

GA−LP LP Infomap OSLOM

0.1

GA−LP LP Infomap OSLOM

Mean Time (sec) 400 600 800

PT

CE

Mean NMI 0.4 0.6

0.8

1.0

ED

M

ties.

0.4

µt

0.5

0.6

0.7

0.8

0.1

0.2

0.3

0.4

µt

0.5

0.6

0.7

0.8

Figure 9: Average NMI and times of networks with 10000 vertices and large-sized communities.

22

0.1

0.2

0.3

0

0.4

µt

0.5

0.6

0.7

AN US

0.0

0.2

GA−LP LP Infomap OSLOM

GA−LP LP Infomap OSLOM

CR IP T

Mean NMI 0.4 0.6

0.8

1.0

Mean Time (sec) 1000 2000 3000 4000 5000

ACCEPTED MANUSCRIPT

0.8

0.1

0.2

0.3

0.4

µt

0.5

0.6

0.7

0.8

Figure 10: Average NMI and times of networks with 50000 vertices and small-sized commu-

0.1

0.2

0.3

GA−LP LP Infomap OSLOM

0

0.2

GA−LP LP Infomap OSLOM

0.0

AC

Mean Time (sec) 1000 2000 3000 4000 5000

PT

CE

Mean NMI 0.4 0.6

0.8

1.0

ED

M

nities.

0.4

µt

0.5

0.6

0.7

0.8

0.1

0.2

0.3

0.4

µt

0.5

0.6

0.7

0.8

Figure 11: Average NMI and times of networks with 50000 vertices and large-sized communities.

23

ACCEPTED MANUSCRIPT

Furthermore, these figures show the mean times required for each algorithm to return its solution. The first goal of our strategy was successfully achieved. It concerns the local

CR IP T

strategy that refines the results of LP. In all case studies, GA-LP outperformed LP. However, considering the state-of-the-art algorithms Infomap and OSLOM,

the comparative analysis shows that OSLOM was the most robust and that achieved the best results in the majority of the networks tested. Infomap had difficulties in defining partitions for the class L of networks with mixture param-

eter 0.8. Apart from these networks, Infomap showed a very good performance,

AN US

with high NMI values. On average, we observe GA-LP was very competitive with these two strategies even though it did not outperform them.

On the one hand, both OSLOM and Infomap presented very high computational times. On the other hand, GA-LP had average times close to those required by LP, both very low when comparing to the other algorithms. These results corroborate that GA-LP is an outstanding strategy for detecting com-

M

munities in large scale networks.

In complementary experiments, we tested GA-LP with even larger networks.

ED

Both Infomap and OSLOM were too slow in detecting communities in such networks, being very hard to obtain results in the machine used to run the experiments. Both GA-LP had NMI values very close to those achieved with

PT

the networks with 50000 vertices. As observed in Santos et al. (2016), OSLOM presents a poor performance

CE

in detecting communities in dense networks, i.e, networks with a high number of arcs. To assess the results achieved by GA-LP in dense networks, the dense networks generated by Santos et al. (2016) were used. These networks have the

AC

same configurations as those networks with 1000 and 5000 vertices from the set L. However, to ensure the denser communities, the average and maximum in-degree were set with the values 40 and 100, respectively. Figures 12 and 13 display the average results of GA-LP, LP, Infomap and OSLOM for detecting communities in dense networks. The results achieved by GA-LP were very stable, outperforming the other 24

CR IP T

GA−LP LP Infomap OSLOM

Mean NMI 0.4 0.6

Mean Time (sec) 50 100

0.8

1.0

150

ACCEPTED MANUSCRIPT

0.1

0.2

0.3

0

0.4

µt

0.5

0.6

0.7

AN US

0.0

0.2

GA−LP LP Infomap OSLOM

0.8

0.1

0.2

0.3

0.4

µt

0.5

0.6

0.7

0.8

Figure 12: Average NMI and times of dense networks with 1000 vertices and large-sized

AC

0.2

0.3

200 0

0.0

0.2

GA−LP LP Infomap OSLOM

0.1

GA−LP LP Infomap OSLOM

Mean Time (sec) 50 100 150

PT

CE

Mean NMI 0.4 0.6

0.8

1.0

ED

M

communities.

0.4

µt

0.5

0.6

0.7

0.8

0.1

0.2

0.3

0.4

µt

0.5

0.6

0.7

0.8

Figure 13: Average NMI and times of dense networks with 5000 vertices and large-sized communities.

25

ACCEPTED MANUSCRIPT

strategies on average in all tested networks. The time spent by GA-LP and LP were very similar and significantly lower than the other strategies. According to this experiment, GA-LP is a strategy highly recommended to detect commu-

CR IP T

nities in large networks. Next section presents an experiment carried out using two real directed networks.

6. Experiment II

This experiment considers two real networks, one undirected and the other

AN US

directed. The undirected network is the well-known social network afootball (Girvan and Newman, 2002). The network afootball is a collection of data regarding matches of the division I-A of American college football teams played in fall 2000. The 115 vertices of this network represent the football teams and they are related by an edge if the corresponding teams played against each other

M

in a match in the season. The number of edges in afootball is 613. The expected partition in this network corresponds to the groups of the 12 conferences of the season. The members of these groups are highly related. Figure 14 displays the

ED

network with plotted using the visualization function from package igraph for R-project, considering the layout proposed in (Kamada and Kawai, 1988).

PT

The NMI of the partition found by GA-LP was 0.91151. LP, Infomap and OSLOM obtained communities whose NMI values were, respectively, 0.91085, 0.92419 and 0.91568. This result shows that GA-LP still is efficient in finding

CE

communities in undirected networks. Regarding the real directed network, we tested GA-LP in the network pol-

Blogs (Adamic and Glance, 2005). This network is a hyperlinks representation

AC

of 1490 blogs that discuss US politics. The vertices correspond to the sites of the blogs and an arc between a pair of vertices indicates the hyperlink between the two blogs. The number of arcs in this network is 18910. The expected partition regards the classification of the blogs into liberal and conservative. In this network, 266 blogs have no hyperlink in or out them, then, to detect communities,

26

7 7

2 2 2 2

9

7 9

7 9

3

2 10

10

8 10

11

1

11

4

6

6

4 1 4

5 11

4 11

1 4 4 4 1

5 5

5

5

5

5

11

1 4 1 4 4 6 6

8

PT

8

4

11 1 1

10 6 6 10 8 10 10

8

ED

8 8

8

5

11 5

4 5 11 11

10

10 8 10 8 10 10

10

3

39

3

9

2

3

3

3

9 3

2

3

3

97

7

7

2 8 8

9

2

7

7

7

M

9

7

9

3

AN US

7

7

7

7

CR IP T

ACCEPTED MANUSCRIPT

Figure 14: The undirected network afooball labeled according to the communities obtained

AC

CE

by GA-LP.

27

ACCEPTED MANUSCRIPT

7 10

5

3

1

2

CR IP T

6

9

11 4

AN US

8

Figure 15: The directed network polBlogs, with the arcs contracted according to the groups found by GA-LP.

these vertices were removed. Figure 15 displays the network, labeled according

M

to the communities detected by GA-LP.

The NMI of the communities found by GA-LP was 0.6889. OSLOM, Infomap

ED

and LP, obtained partitions whose NMI were, respectively, 0.57208, 0.43547 and 0.38534. This result demonstrates that GA-LP provided a partition closer to

PT

the expected communities.

7. Conclusions

CE

This paper proposed a local-based genetic algorithm strategy to detect com-

munities in directed large-scale networks. It relies on the label propagation (LP) algorithm, well-known for being an outstanding strategy to accomplish the task

AC

of identifying community structures in large-scale networks. Nevertheless, because in LP the propagation of labels is based on the most frequent labels within the neighborhood, it is very challenging for LP to detect communities in dense networks. For these networks, it very common to achieve singletons or very large communities, being more likely the loss of information resulting in the

28

ACCEPTED MANUSCRIPT

detection of poor community structures. To both overcome the challenge of detecting communities in dense networks and to address directed large-scale networks, this paper presents the Genetic

CR IP T

Algorithm based on the Label Propagation (GA-LP). Its main trait is the local nature of the key operators of genetic algorithms, that makes it possible

to address very large networks with a reasonable computational time. The fitness function used in GA-LP was the modularity, a measure not limited to the 1-neighborhood of the vertices. However, in distributed applications, local assessment metrics can be used, or even estimated by the consensus between the

AN US

different partitions.

In computational experiments with LFR networks, GA-LP was more robust than LP, requiring approximately the same computational time as LP. Moreover, GA-LP outperformed the benchmark algorithm OSLOM in detecting communities in dense and large networks. In these networks, it was competitive with Infomap but required a significantly smaller computational time than this strat-

M

egy.

Considering networks with low density and a few hundreds of vertices, GA-

ED

LP outperformed only LP, being very competitive with the other strategies. Then, GA-LP did not outperformed Infomap and OSLOM that were significant slower than it, but found their results within 150 seconds.

PT

The results point to an interesting behavior of GA-LP. A stability analysis of the method that initializes the population in GA-LP by evaluating independent

CE

trials has shown (in Appendix A) that this phase is very stable even for networks with high mixture parameter. Therefore, being very robust GA-LP can better address applications with fuzzy communities to which the existing genetic

AC

algorithms and another well-known community detection algorithms usually fail. As future work, the authors intend to develop the distributed version of this

strategy to address an application of routing in wireless sensor networks. In this case, the authors aim to employ the local version of the modularity measure to accelerate the calculation of the fitness of a community. Moreover, as the application requires more numbered communities, a size controlling mechanism 29

ACCEPTED MANUSCRIPT

will be used in this new version of GA-LP. Acknowledgments

CR IP T

The authors of this paper are grateful to Funda¸c˜ ao de Amparo ` a Pesquisa do Estado de S˜ ao Paulo (FAPESP) (Grant Numbers:

2015/21660-4 and

2015/18580-9) and Conselho Nacional de Desenvolvimento em Pequisa (CNPq)

(Grant Numbers: 448614/2014-6 and 308708/2015-6) for their research fund-

ing. The authors also gratefully acknowledge the anonymous referees for their

AN US

constructive comments that improved this paper. References

Adamic, L. A. and Glance, N. (2005). The political blogosphere and the 2004 u.s. election: Divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery, pages 36–43, New York, NY, USA. ACM.

M

Arenas, A., Duch, J., Fern´ andez, A., and G´ omez, S. (2007). Size reduction of complex networks preserving modularity. New Journal of Phisics, 9:176.

ED

Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., and Hwang, D.-U. (2006). Complex networks: structure and dynamics. Physics Reports, 424:175–308.

PT

Brandes, U., Delling, D., Gaertler, M., Gorke, R., Hoefer, M., Nikoloski, Z., and Wagner, D. (2008). On modularity clustering. Knowledge and Data

CE

Engineering, IEEE Transactions on, 20(2):172–188. Danon, L., D´ıaz-Guilera, A., Duch, J., and Arenas, A. (2005). Comparing

AC

community structure identification. Journal of Statistical Mechanics: Theory and Experiment, 2005(09):P09008.

Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486:75– 174.

Garlaschelli, D. and Loffredo, M. I. (2004). Patterns of link reciprocity in directed networks. Phys. Rev. Lett., 93:268701. 30

ACCEPTED MANUSCRIPT

Girvan, M. and Newman, M. (2002). Community structure in social and biological networks. National Academy of Sciences, 99(12):7821–7826. Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization and Ma-

USA, 1st edition.

CR IP T

chine Learning. Addison-Wesley Longman Publishing Co., Inc., Boston, MA,

Gong, M., Fu, B., Jiao, L., and Du, H. (2011). Memetic algorithm for community detection in networks. Phys. Rev. E, 84:2011.

Kamada, T. and Kawai, S. (1988). An algorithm for drawing general undirected

AN US

graphs. Information Processing Letters, 31:7–15.

Lancichinetti, A. and Fortunato, S. (2009a). Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Physical Review E, page 016118.

Lancichinetti, A. and Fortunato, S. (2009b). Community detection algorithms:

M

a comparative analysis. Physical Review E, 80:056117.

ED

Lancichinetti, A., Fortunato, S., and F, R. (2008). Benchmark graphs for testing community detection algorithms. Physical Review E, 78:046110. Lancichinetti, A., Radicchi, F., Ramasco, J. J., Fortunato, S., et al. (2011).

PT

Finding statistically significant communities in networks.

PloS One,

6(4):e18961.

CE

Leicht, E. A. and Newman, M. E. J. (2008). Community structure in directed networks. Physical Review Letter, 100:118703.

AC

Ma, L., Gong, M., Liu, J., Cai, Q., and Jiao, L. (2014). Multi-level learning based memetic algorithm for community detection. Applied Soft Computing, 19:121 – 133.

Malliaros, F. D. and Vazirgiannis, M. (2013). Clustering and community detection in directed networks: A survey. Physics Reports, 533(95-142).

31

ACCEPTED MANUSCRIPT

Mu, C.-H., Xie, J., Liu, Y., Chen, F., Liu, Y., and Jiao, L.-C. (2015). Memetic algorithm with simulated annealing strategy and tightness greedy optimization for community detection in networks. Applied Soft Computing, 34:485 –

CR IP T

501. Newman, M. E. J. and Girvan, M. (2004). Finding and evaluating community structure in networks. American Physical Society, 69(2):1–15.

Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., and Parisi, D. (2004). Defining and identifying communities in networks. Proceedings of the National

AN US

Academy of Sciences of the United States of America, 101:2658–2663.

Raghavan, U. N., Albert, R., and Kumara, S. (2007). Near linear time algorithm to detect community structures in large-scale networks. Physical Review E, 76(3):036106.

Rosvall, M. and Bergstrom, C. T. (2008). Maps of random walks on complex

M

networks reveal community structure. Proceedings of the National Academy of Sciences, 105:1118–1123.

ED

Santos, C. P., Carvalho, D. M., and Nascimento, M. C. V. (2016). A consensus graph clustering algorithm for directed networks. Expert Systems with Applications.

PT

Schaeffer, S. E. (2007). Graph clustering. Computer Science Review, 1:27–64. Shang, R., Bai, J., Jiao, L., and Jin, C. (2013). Community detection based

CE

on modularity and an improved genetic algorithm. Physica A: Statistical Mechanics and its Applications, 392(5):1215 – 1231.

AC

Strogatz, S. H. (2011). Exploring complex networks. Nature, 410:268–276.

Appendix A. Analysis of Initial Population This appendix presents an analysis of initial populations generated in inde-

pendent trials. The aim of this analysis is to attest the robustness of GA-LP, since it has a random component for producing initial solutions. 32

ACCEPTED MANUSCRIPT

Table A.2: Table with the average fitness (modularity) values of ten initial populations (P1 to P10) of a network with 50000 vertices and small-sized clusters randomly selected from those with the corresponding mixture parameter (µt ).

0.2

0.3

0.4

0.5

0.6

0.7

0.8

P1

0.157149

0.111733

0.071028

0.045560

0.029229

0.019743

0.013325

0.009950

P2

0.156171

0.111726

0.071330

0.045096

0.029281

0.019740

0.013569

0.009767

P3

0.156900

0.110964

0.071441

0.045210

0.029108

0.019552

0.013321

0.009923

P4

0.156571

0.111585

0.071119

0.045155

0.029437

0.019491

0.013342

0.009861

P5

0.156877

0.110971

0.071092

0.045531

0.029239

0.019483

0.013264

0.009714

P6

0.156470

0.111597

0.071666

0.045216

0.029189

0.019379

0.013445

0.009962

P7

0.157045

0.112128

0.071457

0.045243

0.029078

0.019596

0.013211

0.009749

P8

0.156960

0.111416

0.071208

0.045379

0.029381

0.019516

0.013423

0.009997

P9

0.156791

0.111449

0.071394

0.045633

0.029288

0.019425

0.013276

0.009843

P10

0.157143

0.111884

0.071579

0.045267

0.029285

0.019497

0.013445

0.009853

Average

0.156808

0.111545

0.071331

0.045329

0.029252

0.019542

0.013362

0.009862

Std. Dev.

3.16E-04

3.68E-04

2.15E-04

1.86E-04

1.10E-04

1.21E-04

1.07E-04

9.65E-05

M

AN US

0.1

CR IP T

µt

ED

The analysis considered a sample of the largest networks with the most number clusters. The networks that could provide the most diverse initial populations among those tested in this paper are those with 50000 vertices and

PT

small-sized clusters. Therefore, ten independent trials of GA-LP were executed considering these networks picked at random, one for each mixture parameter. Table A.2 shows the average fitness values within population (row ‘Average’)

CE

and the standard deviation of these averages (row ‘Std. Dev.’). On the basis of results presented in Table A.2, for every network, the signifi-

AC

cantly low standard deviation of the average modularity in the different populations ensures that GA-LP is very stable. It is worth mentioning that individuals inside population were diverse even though their fitness value deviated very low from the average, resulting in a standard deviation very similar to that observed among populations. Tests with smaller networks presented approximately the same variability, experimentally demonstrating that GA-LP is very robust.

33