Integrating overlapping community discovery and role analysis: Bayesian probabilistic generative modeling and mean-field variational inference

Integrating overlapping community discovery and role analysis: Bayesian probabilistic generative modeling and mean-field variational inference

Engineering Applications of Artificial Intelligence 89 (2020) 103437 Contents lists available at ScienceDirect Engineering Applications of Artificia...

2MB Sizes 0 Downloads 15 Views

Engineering Applications of Artificial Intelligence 89 (2020) 103437

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai

Integrating overlapping community discovery and role analysis: Bayesian probabilistic generative modeling and mean-field variational inference✩ Gianni Costa, Riccardo Ortale ∗ ICAR-CNR, Via P. Bucci 8/9C, 87036 Rende (CS), Italy

ARTICLE

INFO

Keywords: Overlapping community discovery Role analysis Link explanation and prediction Generative probabilistic modeling Bayesian network analysis

ABSTRACT The joint modeling of community discovery and role analysis was shown useful to explain, predict and reason on network topology. Nonetheless, earlier research on the integration of both tasks suffers from major limitations. Foremost, a key aspect of role analysis, i.e., the strength of role-to-role interactions, is ignored. Moreover, two fundamental properties of networks are disregarded, i.e., heterogeneity in the connectivity structure of communities and the growing link probability with node involvement in common communities. Additionally, scalability with network size is limited. In this manuscript, we incrementally develop two new machine learning approaches to deal with the foresaid issues. The proposed approaches consist in performing inference under as many Bayesian generative models of networks with overlapping communities and roles. Under both models, nodes are associated with communities and roles through suitable affiliations, that are dichotomized for link directionality. The strength of such affiliations is captured through nonnegative latent random variables, drawn from Gamma priors. Besides, link establishment is explained by both models through Poisson distributions. In particular, under the second model, the parameterizing rate of the Poisson distribution also accommodates the strength of role-torole interactions, as captured via latent mixed-membership stochastic blockmodeling. On sparse networks, the adoption of the Poisson distribution expedites model inference. On this point, mean-field variational inference is derived and implemented as a coordinate-ascent algorithm, for the exploratory and unsupervised analysis of node affiliations. Comparative experiments on several real-world networks demonstrate the superiority of the proposed approaches in community discovery, link prediction as well as scalability.

1. Introduction Complex systems are wholes of interacting entities such as, e.g., human social systems, the World Wide Web, computer networks, food webs, neural networks and so forth. The earlier reductionism in the study of complex systems led to primarily focus on (the attributes of) their interacting entities taken in isolation. By contrast, the modern network-centric perspective considers the interactions among entities as the mechanisms governing the functioning of complex systems along with (the attributes of) their constituent entities. Accordingly, the properties of complex systems are investigated, understood and explained in terms of interaction patterns among interdependent entities (Kolaczyk, 2009; Wasserman and Faust, 1994). For this purpose, network analysis provides a solid foundation in which to ground the development of theories, methods, techniques, models and algorithms. Three major mainstays of such developments are statistics, probability and graph theory. Graph theory allows for capturing the relational aspects of

complex systems by means of compositional structures, in which entities and relationships are represented, respectively, as nodes and links. Statistics along with probability theory allow for coherently reasoning on the observed properties of complex systems in addition to dealing with uncertainty. Two key tasks in networks analysis are community discovery (Fortunato, 2010; Xie et al., 2013; Fortunato and Hric, 2016) and role analysis (McCallum et al., 2007; Scripps et al., 2007a; Ross and Ahmed, 2015). Community discovery unveils the latent organization of networks into functional structures, which are beneficial to the prediction of prospective node ties (Gopalan et al., 2013; Gopalan and Blei, 2013; Xie et al., 2013; Liben-Nowell and Kleinberg, 2007). Role analysis explains node interactions through latent behavioral classes. The seamless integration of community discovery with role analysis was pioneered in Costa and Ortale (2012) and, subsequently, developed in Costa and Ortale (2013, 2014), for the purpose of gaining a deeper insight into network connectivity. The underlying idea is

✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.103437. ∗ Corresponding author. E-mail addresses: [email protected] (G. Costa), [email protected] (R. Ortale).

https://doi.org/10.1016/j.engappai.2019.103437 Received 7 June 2019; Received in revised form 29 October 2019; Accepted 18 December 2019 Available online xxxx 0952-1976/© 2019 Elsevier Ltd. All rights reserved.

G. Costa and R. Ortale

Engineering Applications of Artificial Intelligence 89 (2020) 103437

All latent variables under TORTILLA and QUESADILLA are inferred through approximate posterior inference. For this purpose, both models are bundled with respective coordinate-ascent algorithms, implementing mean-field variational inference. The latter ultimately enables the exploratory analysis of network communities and roles together with the prediction of unobserved as well as prospective links. Because of their novel peculiarities, TORTILLA and QUESADILLA are suited to developing new tools for law enforcement, homeland security and intelligence as well as advanced applications for complex system analysis, recommendation and personalization. A comparative experimentation on a selection of several real-world benchmark networks demonstrates the superior performance of TORTILLA and QUESADILLA in community discovery, link prediction as well as scalability. The innovative contributions of this manuscript are summarized below.

that the two tasks synergically refine each other. Indeed, role analysis enriches network communities with insightful explanations about the linking behavior of nodes. Symmetrically, community discovery allows for a natural contextualization, that provides implicit hints to understand node roles. Hence, the integration of both tasks into a unified generative model of networks is useful to more realistically explain, forecast and reason on ties between pairs of nodes, in terms of their respective affiliations to communities as well as roles. This is of great relevance in several applicative domains such as, e.g., the technological, information, social, ecological, biological, criminological and recommendation ones. The previous approaches in Costa and Ortale (2012, 2013) rely on Bayesian mixed-membership models, whose generative processes place directed links between nodes based on their respective involvement in communities and roles. The formation of ties is further refined in Costa and Ortale (2014, 2018), where node-specific and contextual latent interaction factors are accommodated. Despite their proven effectiveness and predictive power, the approaches in Costa and Ortale (2012, 2013, 2014) suffer from some major limitations. Firstly, the strength of role-to-role interactions is not taken into account. Secondly, two major features of networks are disregarded, i.e., the heterogeneity of the connectivity structure inside communities (Yang et al., 2014) as well as the growth of link probability with node affiliations to common communities (Yang and Leskovec, 2014). Furthermore, the approaches in Costa and Ortale (2012, 2013, 2014) do not scale with the size of the underlying networks. In this manuscript, we present two innovative model-based machine-learning approaches (Bishop, 2013) to the exploratory and unsupervised analysis of overlapping communities and behavioral roles in networks. The devised approaches consist in performing posterior inference in as many Bayesian probabilistic generative models of networks, respectively called TORTILLA and QUESADILLA. These essentially express statistical assumptions on the latent factors governing network generation. Essentially, in both approaches, probabilistic graphic modeling (Bishop, 2013; Blei, 2014) is the glue to combine several innovations, with which TORTILLA and QUESADILLA overcome the aforesaid limitations of previous research (Costa and Ortale, 2012, 2013, 2014). More precisely, TORTILLA (neTwORk Topology from communIty and roLe affiLiAtions) (Costa and Ortale, 2016c) relies on directed affiliation modeling, in order to unveil the heterogeneous connectivity structure of communities. Thus, realistic overlaps and hierarchical nestings (Yang et al., 2014) are naturally enabled between the unveiled communities. In particular, a latent weighted generalization of directed affiliation modeling is devised, with the aim to capture the nonnegative extent of node affiliations to the underlying communities and roles. Accordingly, for each node, the strength of its affiliations to communities does not sum to 1. Likewise, the strength of its affiliations to roles does not sum to 1. This is an appealing feature, that allows for retaining mixedmembership flexibility without unnatural assumptions and unnecessary constraints on the underlying community-overlap structure (Yang and Leskovec, 2014). Under TORTILLA, the strength of node affiliations to both communities and roles is sampled from Gamma distributions, which improves the interpretability of TORTILLA. Moreover, the presence/absence of directed links is sampled from a Poisson distribution. The latter is parameterized by a rate, that captures the extent of interaction between any two nodes, so that the stronger the affiliations of any two nodes to shared communities and corresponding roles, the more likely a directed link connecting both. The choice of the Poisson distribution is also useful to expedite posterior inference on sparse networks (Gopalan et al., 2015), which is of great practical relevance in real-world applications. QUESADILLA (QUantified nodE interactionS for grAph moDeling based on communIty and roLe affiLiAtions) is a new model, that extends TORTILLA with the aim to more accurately explain network structure also in terms of both role-to-role interactions and role affiliations to communities.

• We advance earlier research on jointly modeling community discovery with role analysis through the incorporation of recent findings in modern network analysis. From this viewpoint, TORTILLA and QUESADILLA accommodate a combination of several innovations. These allow the devised models to capture realistic community overlaps (Yang and Leskovec, 2014; Yang et al., 2014) as well as heterogeneity in the connectivity structure of communities (Yang et al., 2014), while improving their scalability on sparse networks (Gopalan et al., 2015). • A more detailed discussion of TORTILLA is provided with respect to the one originally introduced in Costa and Ortale (2016c). Precisely, all mathematical and algorithmic details are covered on the derivation of mean-field variational inference and its implementation into a variational coordinate-ascent algorithm. • The empirical assessment of TORTILLA in Costa and Ortale (2016c) is extended with new insightful tests. • A new model, QUESADILLA, is presented and tested. All experiments involving QUESADILLA are original. • A comparative experimentation of TORTILLA and QUESADILLA is conducted on a selection of various real-world benchmark networks, with two such networks being very-large-scale, i.e., Twitter and Google+. In particular, the choice of Google+ enriches the original selection of benchmark networks in Costa and Ortale (2016c). Accordingly, all tests of the devised models and their competitors over Google+ are novel. • An in-depth investigation of the main properties of the selected benchmark networks is performed to supplement the empirical evaluation in Costa and Ortale (2016c). • The qualitative evaluation is new. • An overview of related works is presented. The outline of the present manuscript is as follows. Section 2 situates our research effort in the literature through a review of the most closely related works. Section 3 introduces the adopted notation along with the preliminary concepts. Section 4 proposes TORTILLA. Section 5 presents the mathematical derivation of variational inference under TORTILLA along with a coordinate-ascent algorithm, that implements the resulting variational updates. Section 6 is devoted to the estimates of the latent variables under TORTILLA for network exploration and link prediction. Section 7 discusses QUESADILLA as an incremental enhancement of TORTILLA. Section 8 presents the real-world networks chosen for experimental purposes, a study of some major statistics of such networks and an extensive empirical evaluation of TORTILLA and QUESADILLA against state-of-the-art competitors. Finally, Section 9 concludes and previews future research. 2. Related works Research on community discovery and role assignment can be categorized into non-probabilistic as well as probabilistic approaches. 2

G. Costa and R. Ortale

Engineering Applications of Artificial Intelligence 89 (2020) 103437

nodes are sampled from an underlying Bernoulli distribution, that is parameterized by the community memberships of the involved nodes. A common limitation of the methods in Henderson et al. (2010), Pathak et al. (2008), Zhou et al. (2006) and Zhang et al. (2007) is that, differently from the proposed approaches, the roles of nodes are not considered explicitly. Topic models (Blei and Lafferty, 2009; Steyvers and Griffiths, 2007) have also been used to unveil social network structure and, jointly, perform role analysis from both node interactions and their contents. Specifically, ART (author-recipient-topic) and RART (role-authorrecipient-topic) are developed in McCallum et al. (2007) for role analysis in networks of email exchanges among correspondents. However, communities are explicitly uncovered neither by ART nor by RART. rrPLSA (Zhao et al., 2015) is a regularized topic model, that incorporates key aspects of role theory (Biddle, 1986). Roles have also been considered in some approaches (Xu et al., 2012; Ma et al., 2015) for community question answering (Srba and Bielikova, 2016; Wang et al., 2018; Yuan et al., 2019). Actually, the task of community discovery is not among the aims of Zhao et al. (2015), Xu et al. (2012) and Ma et al. (2015). The synergic modeling of community discovery and role analysis is accomplished in Costa and Ortale (2012, 2013, 2014, 2016a). Nonetheless, such models differ from TORTILLA and QUESADILLA in several respects. As in the case of TORTILLA, even in Costa and Ortale (2012, 2013, 2014), the contribution of node roles to link establishment is not thoroughly captured. Also, heterogeneity in the connectivity structure of communities is not allowed under (Costa and Ortale, 2012, 2013, 2014). Moreover, posterior inference under (Costa and Ortale, 2012, 2013, 2014) relies on Gibbs sampling (Bishop, 2006), which penalizes scalability with the size of the underlying networks. TORTILLA and QUESADILLA differ from Costa and Ortale (2016a) in the explicit probabilistic graphic modeling of roles as external, though affiliated to communities.

Among the non-probabilistic approaches, a variety of link-based methods has been proposed for both tasks. According to Newman (2004a), the link-based methods aimed to community discovery are classified into four groups, respectively relying on graph partitioning (Kernighan and Lin, 1970; Pothen et al., 1990), similarity measures from social sciences (Lorrain and White, 1971; Wasserman and Faust, 1994), edge removal (Girvan and Newman, 2002; Radicchi et al., 2004) and modularity maximization (Newman, 2004b; Newman and Girvan, 2004). The common assumption behind all four groups of approaches is that the density of links inside communities is larger than the density of links between communities. Link partitioning is another link-based method for community discovery, that defines communities as link partitions (Ahn et al., 2010; Evans, 2010; Evans and Lambiotte, 2009, 2010; Kim and Jeong, 2011; Wu, 2010). The link-based methods conceived for role assignment account for node centrality (Wasserman and Faust, 1994) as well as community-based roles (Chou and Suzuki, 2010; Scripps et al., 2007a,b). Basically, the exploitation of node centrality consists in ranking nodes through some specific measure (e.g., betweenness, closeness and degree), with the aim to establish their importance in the context of a network (Wasserman and Faust, 1994). However, such a node ranking disregards communities (Chou and Suzuki, 2010; Scripps et al., 2007b). The two approaches proposed in Scripps et al. (2007a,b) choose one of four roles (i.e., loner, big fish, bridge and ambassador) for a given node, depending on its degree as well as the score of a metric for the estimation of how many communities (i.e., cliques) are participated by that node. The approach presented in Chou and Suzuki (2010) chooses one of three roles (i.e., hubs, gateways and bridges) for those nodes, that bridge communities (i.e., densely connected groups of nodes), without knowledge about community structure. A well-known drawback of non-probabilistic approaches for community discovery as well as role assignment, that also affects link-based methods, is that the individual nodes are generally involved into one community and play one role. Link partitioning is an exception, that naturally captures community overlap (Ahn et al., 2010; Evans and Lambiotte, 2009, 2010). Nonetheless, there are not guarantees that it leads to a higher-quality detection of communities in comparison with node partitioning (Xie et al., 2013; Fortunato, 2010). Probabilistic approaches for community discovery and role assignment enable the affiliation of nodes to more than one community and one role. Several probabilistic approaches have been proposed in the literature, with the aim to uncover network communities. A selection of representative approaches encompasses (Henderson et al., 2010; Pathak et al., 2008; Zhou et al., 2006; Zhang et al., 2007; Airoldi et al., 2008; Ahn et al., 2010; Evans and Lambiotte, 2009). The seminal LDA model (Blei et al., 2003) was exploited in Zhang et al. (2007), for the purpose of modeling communities as distributions over network nodes. CART (i.e., community-author-recipient-topic) (Pathak et al., 2008) is a model for the detection of semantic communities, in which nodes interact on topics of mutual interest, which are additionally relevant to the individual communities. Essentially, under CART, the author of an email along with its recipient(s) are modeled as being affiliated to the same community, in regard to the topics of the email. CUT1 along with CUT2 are community-user-topic models, also conceived for semantic community detection (Zhou et al., 2006). Both models are again meant for the communication networks, that arise from the exchanges of emails among correspondents. Notwithstanding, unlike CART, CUT1 as well as CUT2 extract communities, respectively, from the topology of the underlying network and the content of the exchanged messages (Pathak et al., 2008). HCDF (Henderson et al., 2010) is a Bayesian framework for hybrid community discovery, which incorporates communities found through other approaches as hints, in order to improve the effectiveness of results along with their consistency across domains. MMB (Airoldi et al., 2008) extends the latent stochastic blockmodel (Arabie et al., 1978) through the accommodation of mixed membership. Under MMB, the directed interactions between

3. Preliminaries The adopted notation along with some fundamental concepts as well as the problem statement are introduced next. 3.1. Network representation The generic network is formalized as a directed graph  = {𝑵  , 𝑬  }. 𝑵  = {1, … , 𝑁} is a set of nodes, that are numbered 1 through 𝑁. 𝑬  ⊆ 𝑵  × 𝑵  is a set of directed links (or, also, ordered node pairs). Nodes are abstractions of the entities involved in the network (such as, e.g., individuals, computers, organizations, web pages, proteins and so forth). Links represent asymmetric node interactions and are summarized by a 𝑁 × 𝑁 binary adjacency matrix 𝑳. Assume that 𝑢 → 𝑣 is some interaction from node 𝑢 to node 𝑣, that may result into a link from 𝑢 to 𝑣. The corresponding entry of 𝑳, namely 𝐿𝑢→𝑣 , is 1 iff the interaction 𝑢 → 𝑣 actually originates a link of  (i.e., ⟨𝑢, 𝑣⟩ ∈ 𝑬  ) and 0 otherwise. 3.2. Latent network characterization Two latent features of a network  are its structural arrangement 𝑪 and the variety 𝑹 of behavioral roles. The latent arrangement 𝑪 = {𝐶1 , … , 𝐶𝐾 } corresponds to the unobserved structural organization of the nodes of  into 𝐾 hidden communities. Within such communities, nodes interact according to connectivity patterns, that are ascribable to unknown behavioral roles. Formally, these are collectively represented as the set 𝑹 = {𝑅1 , … , 𝑅𝐻 } of 𝐻 hidden roles. Each node 𝑢 ∈ 𝑵  is involved in all of the individual communities, although with a different degree of participation. In addition, within each community, all roles can be played by 𝑢, although to a distinct extent. With the aim to accurately model node affiliations to both communities and roles, we 3

G. Costa and R. Ortale

Engineering Applications of Artificial Intelligence 89 (2020) 103437

𝑪. In greater detail, 𝑨(𝑟) ≜ {⟨𝐶𝑘 , 𝑢, 𝜗(𝑟) ⟩|𝜗(𝑟) ∈ Θ}, where the triple 𝑢,𝑘 𝑢,𝑘

follow (Yang et al., 2014) and, accordingly, distinguish between sender as well as receiver affiliations. Basically, a sender affiliation of 𝑢 to the generic community 𝐶𝑘 holds, whenever 𝑢 establishes a link to another node of 𝐶𝑘 . Dually, a receiver affiliation of 𝑢 to 𝐶𝑘 holds, whenever another node of 𝐶𝑘 establishes a link to 𝑢. Obviously, the affiliation of 𝑢 to 𝐶𝑘 can even hold both as a sender and as a receiver. The degree to which 𝑢 is affiliated to 𝐶𝑘 as a sender is the nonnegative and unknown strength 𝜗(𝑠) . 𝜗(𝑠) = 0 iff 𝑢 links to no node inside 𝐶𝑘 . The degree to 𝑢,𝑘 𝑢,𝑘 which 𝑢 is affiliated to 𝐶𝑘 as a receiver is the nonnegative and unknown (𝑟) strength 𝜗𝑢,𝑘 . 𝜗(𝑟) = 0 iff no node inside 𝐶𝑘 links to 𝑢. Besides, a sender 𝑢,𝑘 affiliation of 𝑢 to the arbitrary role 𝑅ℎ means that role 𝑅ℎ is played by 𝑢 when targeting other nodes. Dually, a receiver affiliation of 𝑢 to 𝑅ℎ means that role 𝑅ℎ is played by 𝑢 when being targeted by other nodes. The degree to which 𝑢 is affiliated to 𝑅ℎ as a sender is the nonnegative (𝑠) strength 𝜑𝑢,ℎ . 𝜑(𝑠) = 0 iff 𝑢 does not play role 𝑅ℎ when pointing to 𝑢,ℎ other nodes. The degree to which 𝑢 is affiliated to 𝑅ℎ as a receiver is the nonnegative strength 𝜑(𝑟) . 𝜑(𝑟) = 0 iff 𝑢 does not play role 𝑅ℎ when 𝑢,ℎ 𝑢,ℎ being pointed to by other nodes. Notice that the dichotomization of the affiliations of the individual nodes to the latent network roles is an innovative contribution of the present manuscript. Such a contribution is functional to supplement the conventional dichotomization of the affiliations of nodes to the latent network communities in Yang et al. (2014). Collectively, affiliation strengths are succinctly indicated as Θ ≜ |𝑢 ∈ 𝑵  , 𝑅ℎ ∈ 𝑹}, , 𝜑(𝑟) , 𝜗(𝑟) |𝑢 ∈ 𝑵  , 𝐶𝑘 ∈ } and Φ ≜ {𝜑(𝑠) {𝜗(𝑠) 𝑢,ℎ 𝑢,ℎ 𝑢,𝑘 𝑢,𝑘 respectively. In addition, a suitable notation is exploited to denote node affiliations across interactions. Specifically, 𝐶𝑢→𝑣 ∈ 𝑪 and 𝑅𝑢→𝑣 ∈ 𝑹 indicate, respectively, the membership community and played role of 𝑢, when 𝑢 targets 𝑣. Dually, 𝐶𝑣←𝑢 ∈ 𝑪 and 𝑅𝑣←𝑢 ∈ 𝑹 indicate, respectively, the membership community and played role of 𝑣, when 𝑣 is targeted by 𝑢.

⟨𝐶𝑘 , 𝑢, 𝜗(𝑟) ⟩ is a directed arc from 𝐶𝑘 to 𝑢 weighted by the receiver 𝑢,𝑘 affiliation strength 𝜗(𝑟) . Furthermore, 𝑻 (𝑠) ⊆ 𝑵  × 𝑹 × R+ is the set 𝑢,𝑘 of the weighted sender affiliations of the nodes in 𝑵  to the roles in 𝑹. More specifically, 𝑻 (𝑠) ≜ {⟨𝑢, 𝑅ℎ , 𝜑(𝑠) ⟩|𝜑(𝑠) ∈ Φ}, where the triple 𝑢,ℎ 𝑢,ℎ ⟨𝑢, 𝑅ℎ , 𝜑(𝑠) ⟩ is a directed arc from 𝑢 to 𝑅ℎ weighted by the sender 𝑢,ℎ

affiliation strength 𝜑(𝑠) . Similarly, 𝑻 (𝑟) ⊆ 𝑹 × 𝑵  × R+ is the set of 𝑢,ℎ the weighted receiver affiliation of the nodes in 𝑵  to the roles in 𝑹. In further detail, 𝑻 (𝑟) ≜ {⟨𝑅ℎ , 𝑢, 𝜑(𝑟) ⟩|𝜑(𝑟) ∈ Φ}, where the triple 𝑢,ℎ 𝑢,ℎ ⟨𝑅ℎ , 𝑢, 𝜑(𝑟) ⟩ is a directed arc from 𝑅ℎ to 𝑢 weighted by the receiver 𝑢,ℎ

affiliation strength 𝜑(𝑟) . 𝑢,ℎ 3.4. Problem statement

Let  be an observed network. Also, assume that 𝐾 and 𝐻 are, respectively, the number of latent communities and roles to find in . Our aim is to perform • the unsupervised and exploratory analysis of , namely inferring Θ as well as Φ; • the prediction of prospective links between pairs of unconnected nodes in 𝑵  . For the accomplishment of both tasks, in Section 4, we present a model-based approach, that consists in inferring a posterior distribution over Θ as well as Φ, given the adjacency matrix 𝑳 of the input network . The foresaid posterior distribution is approximated by means of variational inference under a generative latent-variable model of networks, that explains the formation of  from a Bayesian probabilistic viewpoint. The approach of Section 4 is incrementally refined in Section 7. Both approaches share some common assumptions. More precisely, the topology (or, equivalently, adjacency matrix) of the input network  is entirely known. Instead, all affiliation strengths of Θ as well as Φ are unknown and directly immeasurable. For this reason, the individual elements of Θ as well as Φ are considered as suitable random variables. Moreover, for any two nodes 𝑢, 𝑣 ∈ 𝑵  , 𝐶𝑢→𝑣 , 𝑅𝑢→𝑣 , 𝐶𝑢←𝑣 , and 𝑅𝑢←𝑣 are also regarded as specific random variables.

3.3. Latent weighted generalized affiliation modeling Affiliation strengths Θ and Φ are used at Section 4 to explain network topology. Hereunder, we highlight their connection with directed affiliation modeling (Yang et al., 2014). Formally, assume that  is a generic network. Also, let 𝑪 and 𝑯 be, respectively, the latent communities and roles of . The notion of directed affiliation model denotes a bipartite graph 𝐵(𝑵  ∪ 𝑪, 𝑨(𝑠) ∪ 𝑨(𝑟) ), where

4. Community- and role-affiliation modeling under TORTILLA

• 𝑵  ∪ 𝑪 represents a set of heterogeneous nodes; • 𝑨(𝑠) ⊆ 𝑵  × 𝑪 is the set of node affiliations to communities as senders; • 𝑨(𝑟) ⊆ 𝑵  × 𝑪 is the set of node affiliations to communities as receivers.

TORTILLA (neTwORk Topology from communIty and roLe affiLiAtions) is a generative network model, in which tie formation is the outcome of a Bayesian probabilistic process. The latter establishes the degree to which nodes are affiliated to communities as well as roles and, then, places directed edges between nodes. Such a generative process meets three key requirements regarding network connectivity, that are specified next.

Directed affiliation modeling is of great relevance in network analysis for two well-known reasons (Yang et al., 2014). Foremost, 𝐵(𝑵  ∪ 𝑪, 𝑨(𝑠) ∪ 𝑨(𝑟) ) allows for a diversified connectivity structure of the communities in 𝑪, that are accordingly categorized into cohesive as well as 2-mode. Additionally, realistic overlaps/hierarchical nestings are enabled by 𝐵(𝑵  ∪ 𝑪, 𝑨(𝑠) ∪ 𝑨(𝑟) ) between the communities in 𝑪. Interestingly, under TORTILLA, Θ and Φ specify a latent weighted generalization of conventional directed affiliation modeling. Indeed, inferring Θ and Φ implicitly defines a heterogeneous affiliation graph 𝐵(𝑽 𝐵 , 𝑬 𝐵 ). Here, 𝑽 𝐵 ≜ 𝑵  ∪ 𝑪 ∪ 𝑹 denotes a heterogeneous set of nodes. 𝑬 𝐵 ≜ 𝑨(𝑠) ∪ 𝑨(𝑟) ∪ 𝑻 (𝑠) ∪ 𝑻 (𝑟) represents a heterogeneous set of arcs. Basically, in 𝐵(𝑽 𝐵 , 𝑬 𝐵 ), four different types of arcs (i.e., 𝑨(𝑠) , 𝑨(𝑟) , 𝑻 (𝑠) and 𝑻 (𝑟) ) relate three distinct types of nodes (i.e., 𝑵  , 𝑪 and 𝑹). In particular, 𝑨(𝑠) ⊆ 𝑵  × 𝑪 × R+ is the set of the weighted sender affiliations of the nodes in 𝑵  to the latent communities in 𝑪. More in

1. Nodes are affiliated to multiple communities. For each node, the cumulative strength of its affiliations to the various communities does not amount to 1. Accordingly, a strong affiliation of the generic node to one or more communities does not diminish the strength of affiliation to the remaining membership communities. 2. Nodes are affiliated to multiple roles. For each node, the cumulative strength of its affiliations to the various roles does not amount to 1. Accordingly, a strong affiliation of the generic node to one or more roles does not diminish the strength of affiliation to the other roles. 3. The probability that two nodes are connected by the generative process through a directed link grows with their degree of affiliation to shared communities as well as respective roles.

detail, 𝑨(𝑠) ≜ {⟨𝑢, 𝐶𝑘 , 𝜗(𝑠) ⟩|𝜗(𝑠) ∈ Θ}, where the triple ⟨𝑢, 𝐶𝑘 , 𝜗(𝑠) ⟩ is a 𝑢,𝑘 𝑢,𝑘 𝑢,𝑘 directed arc from 𝑢 to 𝐶𝑘 weighted by the sender affiliation strength 𝜗(𝑠) . Analogously, 𝑨(𝑟) ⊆ 𝑪 × 𝑵  × R+ is the set of the weighted 𝑢,𝑘 receiver affiliations of the nodes in 𝑵  to the latent communities in

The mathematical details of TORTILLA are provided in Section 4.1. Its generative semantics is covered in Section 4.2, where it is also explained how TORTILLA meets the foresaid requirements. 4

G. Costa and R. Ortale

Engineering Applications of Artificial Intelligence 89 (2020) 103437

4.1. Observed-data likelihood and prior distributions Let  be an observed network with a latent organization 𝑪 of its nodes into 𝐾 unknown communities and an underlying set 𝑹 of 𝐻 behavioral roles. TORTILLA postulates Bayesian probabilistic assumptions, that explain how the unknown affiliation strengths Θ and Φ govern the formation of . Formally, the data likelihood, i.e., the distribution over the observed links 𝑳 conditioned on Θ as well as Φ, is ∏ (𝑟) (𝑠) (𝑟) Pr(𝑳|Θ, Φ) = Pr(𝐿𝑢→𝑣 |𝝑(𝑠) 𝑢,⋅ , 𝝑𝑣,⋅ , 𝝋𝑢,⋅ , 𝝋𝑣,⋅ ) 𝑢,𝑣∈𝑵  (𝑠) (𝑟) (𝑟) where Pr(𝐿𝑢→𝑣 |𝝑(𝑠) 𝑢,⋅ , 𝝑𝑣,⋅ , 𝝋𝑢,⋅ , 𝝋𝑣,⋅ ) is a generalization of the probabilistic link establishment in Yang and Leskovec (2014) and Yang et al. (2014), that also benefits of Poisson modeling (Gopalan et al., 2015). (𝑟) (𝑠) (𝑟) Specifically, Pr(𝐿𝑢→𝑣 |𝝑𝑢,⋅ , 𝝑𝑣,⋅ , 𝝋(𝑠) 𝑢,⋅ , 𝝋𝑣,⋅ ) conceives 𝐿𝑢→𝑣 as a Poissondistributed random variable, whose governing parameter is the cumulative extent of interaction from 𝑢 to 𝑣 across their common community affiliations and respective roles. (𝑠) (𝑟) (𝑟) Accordingly, Pr(𝐿𝑢→𝑣 |𝝑(𝑠) 𝑢,⋅ , 𝝑𝑣,⋅ , 𝝋𝑢,⋅ , 𝝋𝑣,⋅ ) ≜ 𝑃 𝑜𝑖𝑠𝑠𝑜𝑛(𝐿𝑢→𝑣 |𝜆𝑢,𝑣 ) with 𝜆𝑢,𝑣 being the below parameterizing rate

𝜆𝑢,𝑣 =

𝐾 ∑

𝐻 ∑

𝑘=1 ℎ=1,ℎ′ =1

(𝑠) (𝑠) (𝑟) (𝑟) 𝜑𝑢,ℎ 𝜑𝑣,ℎ′ 𝜗𝑣,𝑘 𝜗𝑢,𝑘

Fig. 1. Graphical representation of TORTILLA.

(1)

individual node can be affiliated to all communities and roles. Additionally, the strength of such affiliations to the membership communities as well as the played roles does not add up to 1. At step (B) of Fig. 2, the affiliation strengths Θ and Φ are leveraged to rule link establishment. Interestingly, the Poisson distribution on the observed links 𝑳 is useful for faster posterior inference from sparse networks (Gopalan et al., 2015), that are typically encountered in network analysis. Moreover, the generic Poisson rate 𝜆𝑢,𝑣 at Eq. (1) grows with the strength of the affiliations of 𝑢 and 𝑣 to (common) communities and respective roles (since 𝑢 and 𝑣 more strongly interact through their respective roles across multiple common communities). This satisfies the requirement (3) of Section 4. With the generative and mathematical details in place, we proceed to discuss posterior inference under TORTILLA at Section 5.

Due to Eq. (1), under TORTILLA, the probability of a directed link from 𝑢 and 𝑣 increases with the affiliations of both nodes to common communities as well as the strength of their affiliations to both the common communities and the respective roles. The growing link probability with shared community affiliations empirically observed in Yang and Leskovec (2014) is, thus, generalized to also account for link directionality as well as affiliation strength. The below Gamma priors are placed over the generic affiliation (⋅) and 𝜑(⋅) strengths 𝜗𝑢,𝑘 𝑢,ℎ |𝛼, 𝛽) |𝛼, 𝛽) ≜ 𝐺𝑎𝑚𝑚𝑎(𝜗(⋅) Pr(𝜗(⋅) 𝑢,𝑘 𝑢,𝑘 |𝛿, 𝜖) |𝛿, 𝜖) ≜ 𝐺𝑎𝑚𝑚𝑎(𝜑(⋅) Pr(𝜑(⋅) 𝑢,ℎ 𝑢,ℎ where 𝛼, 𝛽, 𝛿 and 𝜖 are hyperparameters of TORTILLA. Specifically, 𝛼 is and 𝛽 the respective the shape of the Gamma prior distribution on 𝜗(⋅) 𝑢,𝑘 rate. Analogously, 𝛿 is the shape of the Gamma prior distribution on and 𝜖 the respective rate. 𝜑(⋅) 𝑢,ℎ Fig. 1 shows (through plate notation) a directed graphical representation of the conditional (in)dependencies of the random variables under TORTILLA.

5. Variational inference TORTILLA is a statistical model that describes the generation of an observed network  given the latent variables Θ and Φ (i.e., the strength of the affiliations of the nodes in  to the latent communities of 𝑪 and underlying roles of 𝑹). Posterior inference amounts to invert the generative process postulated by TORTILLA, to compute the posterior distribution Pr(Θ, Φ|𝑳) over Θ and Φ given the adjacency matrix 𝑳 of the observed network . Unfortunately, as it generally happens with probabilistic models of practical interest in modern Bayesian statistics, exact inference is intractable because of the complexity of the posterior distribution. Thus, we resort to mean-field variational inference. The latter allows for the analytical approximation of the true posterior and, also, tends to be faster as well as more-easily scalable to large networks with respect to MCMC sampling (Blei et al., 2017). In order to simplify the derivation and implementation of meanfield variational inference, we follow the approach in Gopalan et al. (2015) and add further auxiliary latent variables to the original formulation of the TORTILLA model. More precisely, due to the additive property of Poisson random variables, the generic 𝐿𝑢→𝑣 can be rewrit′) ∑ ∑𝐻 (𝑘,ℎ,ℎ′ ) ten as 𝐿𝑢→𝑣 = 𝐾 , where 𝑧(𝑘,ℎ,ℎ ∼ 𝑃 𝑜𝑖𝑠𝑠𝑜𝑛(𝜗(𝑠) 𝜑(𝑠) 𝑢,𝑣 𝑘=1 ℎ,ℎ′ =1 𝑧𝑢,𝑣 𝑢,𝑘 𝑢,ℎ

4.2. Generative semantics The generative process under TORTILLA is the imaginary act of assigning values to the random variables of Fig. 1, on the basis of their conditional (in)dependencies. Essentially, the latent random variables are drawn from their respective Gamma prior distributions, while the observed random variables are realized from the corresponding Poisson distributions. Because of the definition of the Poisson rates at Eq. (1), in such a generative process, the latent random variables explain the observed ones and, accordingly, the generation of the whole network topology. A detailed description of the individual steps of the generative process under TORTILLA is in reported Fig. 2 and commented below. At step (A) of Fig. 2, TORTILLA draws the strengths Θ and Φ of node affiliations to communities and roles, respectively, thus implicitly defining a weighted generalization of conventional affiliation modeling as discussed in Section 3.3. Remarkably, due to the Gamma priors, the strengths of node affiliations are nonnegative. This encourages sparseness in the representation of TORTILLA and, consequently, enhances its interpretation. Moreover, it ensures that TORTILLA actually meets the requirements (1) and (2) of Section 4. Indeed, under TORTILLA, each



) 𝜑(𝑟) 𝜗(𝑟) ). Here, the generic auxiliary random variable 𝑧(𝑘,ℎ,ℎ is the 𝑢,𝑣 𝑣,ℎ′ 𝑣,𝑘 contribution to link 𝐿𝑢→𝑣 from the affiliations of 𝑢 and 𝑣 to the common community 𝐶𝑘 and respective roles 𝑅ℎ and 𝑅ℎ′ . Notably, the auxiliary ′) random vector 𝒛𝑢,𝑣 = {𝑧(𝑘,ℎ,ℎ |𝑘 = 1, … , 𝐾 and ℎ, ℎ′ = 1, … , 𝐻} 𝑢,𝑣 preserves the marginal distribution of 𝐿𝑢→𝑣 .

5

G. Costa and R. Ortale

Engineering Applications of Artificial Intelligence 89 (2020) 103437

Fig. 2. The generative process under TORTILLA.

For the class of conditionally conjugate models, 𝝁 can be fitted to the observed network  through a simple coordinate-ascent algorithm (Blei et al., 2017). Conditional conjugacy holds for any statistical model, in which the full conditional distributions over the latent variables (i.e., the probability distributions over the individual latent variables given the observations and all other latent variables of that model) are in an exponential form. As far as TORTILLA is concerned, the and , 𝜑(𝑠) , 𝜗(𝑟) full conditionals over the generic latent variables 𝜗(𝑠) 𝑢,ℎ 𝑣,𝑘 𝑢,𝑘

𝜑(𝑟) are the Gamma distributions at Eqs. (3) to (6). Moreover, the full 𝑣,ℎ′ conditional over the generic (auxiliary) latent variables 𝒛𝑢,𝑣 is the multinomial at Eq. (7) with outcome probabilities encoded by parameter 𝝔 at Eq. (8) (Bishop et al., 2007). With all of the full conditionals being exponential, TORTILLA (with the addition of the auxiliary variables) is conditionally conjugate and, hence, coordinate-ascent can be used to fit the variational parameters 𝝁. Eqs. (3)–(8) are given in Box I. For this purpose, the form of each factor in the right hand side of Eq. (2) is set to be the corresponding full conditional distribution. Accordingly, the variational parameters 𝜋𝑢,𝑘 , 𝜉𝑣,𝑘 𝜆𝑢,ℎ and 𝜂𝑣,ℎ′ are Gamma parameters, each consisting of a shape and a rate, i.e., 𝜋𝑢,𝑘 = (𝑠ℎ𝑝) (𝑟𝑎𝑡𝑒) (𝑠ℎ𝑝) (𝑟𝑎𝑡𝑒) [𝜋𝑢,𝑘 , 𝜋𝑢,𝑘 ], 𝜉𝑣,𝑘 = [𝜉𝑣,𝑘 , 𝜉𝑣,𝑘 ], 𝜆𝑢,ℎ = [𝜆(𝑠ℎ𝑝) , 𝜆(𝑟𝑎𝑡𝑒) ] and 𝜂𝑣,ℎ′ = 𝑢,ℎ 𝑢,ℎ

(𝑠ℎ𝑝) (𝑟𝑎𝑡𝑒) [𝜂𝑣,ℎ ′ , 𝜂𝑣,ℎ′ ] (where superscripts (shp) and (rate) indeed denote shape

Fig. 3. Graphical representation of QUESADILLA.

and rate, respectively). Instead, 𝜸 𝑢,𝑣 is a multinomial parameter. These variational parameters are optimized one at a time through mathematical updates, obtained as described in Blei et al. (2017) and reported at Eqs. (9) to (13). In particular, the shape parameters in the updates at (𝑘,ℎ,ℎ′ ) (𝑘,ℎ,ℎ′ ) Eqs. (9) to (12) are calculated by using 𝐸[𝑧𝑢,𝑣 ] = 𝐿𝑢→𝑣 𝛾𝑢,𝑣 . Also, the rate parameters in the same updates are computed by applying the definition of the mean of a Gamma distribution, i.e., as the shape-torate ratio. In addition, Ψ[⋅] at Eq. (13) is the digamma function (i.e., the first derivative of 𝑙𝑜𝑔 𝛤 function). Eqs. (9)–(13) is given in Box II. The pseudo-code of the variational coordinate-ascent algorithm is sketched in Algorithm 1. Essentially, the latter iteratively optimizes each variational parameter (while the others are kept fixed) until convergence to a local optimum (Bishop, 2006). More precisely, after a preliminary initialization stage (line 1), the algorithm enters a loop (lines 2–28) to update the individual variational parameters. Such a loop halts upon convergence, which is detected (at line 28) when the difference in the average predictive log likelihood of a validation set

Let 𝒁 = {𝒛𝑢,𝑣 |𝑢, 𝑣 ∈ 𝑽  } be set of all auxiliary variables added to the TORTILLA model. The mean-field family over the latent variables in Θ, Φ and 𝒁 has the below factorized form ∏

𝑞(Θ, Φ, 𝒁|𝝁) =

𝑢∈𝑵  ,𝐶𝑘 ∈𝑪





𝑞(𝜗(𝑠) |𝜋 ) ⋅ 𝑢,𝑘 𝑢,𝑘



𝑞(𝜑(𝑠) |𝜆 )⋅ 𝑢,ℎ 𝑢,ℎ

𝑢∈𝑵  ,𝑅ℎ ∈𝑹



𝑣∈𝑵  ,𝑅′ℎ ∈𝑹

𝑞(𝜗(𝑟) |𝜉 ) 𝑣,𝑘 𝑣,𝑘

𝑣∈𝑵  ,𝐶𝑘 ∈𝑪

𝑞(𝜑(𝑟) |𝜂 ′ ) ⋅ 𝑣,ℎ′ 𝑣,ℎ



𝑞(𝒛𝑢,𝑣 |𝜸 𝑢,𝑣 )

(2)

𝑢,𝑣∈𝑵 

with 𝝁 ≜ {𝜋𝑢,𝑘 , 𝜆𝑢,ℎ , 𝜂𝑣,ℎ′ , 𝜸 𝑢,𝑣 |𝑢, 𝑣 ∈ 𝑵  , 𝐶𝑘 ∈ 𝑪, 𝑅ℎ , 𝑅′ℎ ∈ 𝑹} being the set of all variational parameters, that individually condition the different factors on the right hand side of Eq. (2). 6

G. Costa and R. Ortale

Engineering Applications of Artificial Intelligence 89 (2020) 103437

Fig. 4. The generative process under QUESADILLA.

) ) ( ⎤ ⎡( ∑ ∑ (𝑠) (𝑟) (𝑟) (𝑘,ℎ,ℎ′ ) ⎢ + 𝛽⎥ |𝛼, 𝛽, 𝒁, Θ 𝜑 𝜗 𝜗(𝑠) 𝑧 + 𝛼, 𝜑 (𝑠) , Φ ∼ 𝐺𝑎𝑚𝑚𝑎 𝑢,𝑘 𝑢,ℎ 𝑣,ℎ′ 𝑣,𝑘 ¬𝜗𝑢,𝑘 ⎥ ⎢ 𝑣∈𝑵 ,𝑅 ,𝑅 ∈𝑹 𝑢→𝑣 𝑣∈𝑵  ,𝑅ℎ ,𝑅ℎ′ ∈𝑹  ℎ ℎ′ ⎦ ⎣ ) ) ( ⎤ ⎡( ∑ ∑ ′ + 𝛽⎥ 𝑧(𝑘,ℎ,ℎ ) + 𝛼, 𝜗(𝑠) 𝜑(𝑠) 𝜑(𝑟) |𝛼, 𝛽, 𝒁, Θ¬𝜗(𝑟) , Φ ∼ 𝐺𝑎𝑚𝑚𝑎 ⎢ 𝜗(𝑟) 𝑢,𝑘 𝑢,ℎ 𝑣,ℎ′ 𝑣,𝑘 ⎥ ⎢ 𝑢∈𝑵 ,𝑅 ,𝑅 ∈𝑹 𝑢→𝑣 𝑣,𝑘 𝑢∈𝑵  ,𝑅ℎ ,𝑅ℎ′ ∈𝑹  ℎ ℎ′ ⎦ ⎣ ) ( ) ⎡( ⎤ ∑ ∑ (𝑟) (𝑘,ℎ,ℎ′ ) ⎢ ⎥ 𝜑(𝑠) |𝛿, 𝜖, 𝒁, Θ, Φ 𝑧 𝜗 𝜑 𝜗 + 𝜖 + 𝛿, (𝑠) ∼ 𝐺𝑎𝑚𝑚𝑎 ′ 𝑢,𝑘 𝑣,𝑘 𝑢,ℎ 𝑣,ℎ ¬𝜑𝑢,ℎ ⎢ 𝑣∈𝑵 ,𝐶 ∈𝑪,𝑅 ∈𝑹 𝑢→𝑣 ⎥ 𝑣∈𝑵 ,𝐶 ∈𝑪,𝑅 ∈𝑹 ′ ′  𝑘  𝑘 ⎣ ⎦ ℎ ℎ ) ( ) ⎡( ⎤ ∑ ∑ ′ 𝑧(𝑘,ℎ,ℎ ) + 𝛿, 𝜗𝑢,𝑘 𝜑(𝑠) 𝜗 + 𝜖⎥ 𝜑(𝑟) |𝛿, 𝜖, 𝒁, Θ, Φ¬𝜑(𝑟) ∼ 𝐺𝑎𝑚𝑚𝑎 ⎢ 𝑢,ℎ 𝑣,𝑘 𝑣,ℎ′ ⎥ ⎢ 𝑢∈𝑵 ,𝐶 ∈𝑪,𝑅 ∈𝑹 𝑢→𝑣 𝑣,ℎ′ 𝑢∈𝑵  ,𝐶𝑘 ∈𝑪,𝑅ℎ ∈𝑹  𝑘 ℎ ⎣ ⎦ 𝒛𝑢,𝑣 |𝐿𝑢→𝑣 , Θ, Φ ∼ 𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝐿𝑢→𝑣 , 𝝔𝑢→𝑣 ) 𝝔𝑢→𝑣

(3)

(4)

(5)

(6) (7)

⎡ ⎤ 𝜗(𝑠) 𝜑(𝑠) 𝜑(𝑟) 𝜗(𝑟) 𝜗(𝑠) 𝜑(𝑠) 𝜑(𝑟) 𝜗(𝑟) 𝑢,𝐾 𝑢,𝐻 𝑣,𝐻 𝑣,𝐾 𝑢,1 𝑢,1 𝑣,1 𝑣,1 ⎥ = ⎢∑ ,…, ∑ (𝑠) (𝑠) (𝑟) (𝑟) (𝑠) (𝑠) (𝑟) (𝑟) ⎥ ⎢ 𝐶𝑘 ∈𝑪,𝑅ℎ ,𝑅ℎ′ ∈𝑹 𝜗𝑢,𝑘 𝜑𝑢,ℎ 𝜑𝑣,ℎ′ 𝜗𝑣,𝑘 ⎦ ⎣ 𝐶𝑘 ∈𝑪,𝑅ℎ ,𝑅ℎ′ ∈𝑹 𝜗𝑢,𝑘 𝜑𝑢,ℎ 𝜑𝑣,ℎ′ 𝜗𝑣,𝑘

(8)

Box I.

𝑽 ⊂ 𝑳 falls below 10−6 . Remarkably, variational posterior inference

nodes in Algorithm 1 involve considering only the observed links (the

in TORTILLA is expedited on sparse networks, because of the useful

interested reader is referred to Gopalan et al., 2015 for a detailed

properties of the Poisson distribution. In particular, the sums over

explanation). 7

G. Costa and R. Ortale

Engineering Applications of Artificial Intelligence 89 (2020) 103437

Fig. 5. Enron statistics.

(𝑠ℎ𝑝) (𝑠ℎ𝑝) )⎤ ) ( ⎡( 𝜂𝑣,ℎ 𝜆(𝑠ℎ𝑝) 𝜉𝑣,𝑘 ∑ ∑ ′ 𝑢,ℎ 𝛼 (𝑠ℎ𝑝) (𝑟𝑎𝑡𝑒) (𝑘,ℎ,ℎ′ ) ⎥ 𝐿𝑢→𝑣 𝛾𝑢→𝑣 + 𝛼, + 𝜋𝑢,𝑘 = [𝜋𝑢,𝑘 , 𝜋𝑢,𝑘 ] = ⎢ (𝑟𝑎𝑡𝑒) (𝑟𝑎𝑡𝑒) (𝑟𝑎𝑡𝑒) ⎥ ⎢ 𝑣∈𝑵 ,𝑅 ,𝑅 ∈𝑹 𝛽 𝜂𝑣,ℎ′ 𝜉𝑣,𝑘 𝑣∈𝑵  ,𝑅ℎ ,𝑅ℎ′ ∈𝑹 𝜆𝑢,ℎ  ℎ ℎ′ ⎣ ⎦ (𝑠ℎ𝑝) (𝑠ℎ𝑝) (𝑠ℎ𝑝) )⎤ ) ( ⎡( 𝜂 𝜋𝑢,𝑘 𝜆𝑢,ℎ ∑ ∑ ′ 𝑣,ℎ′ 𝛼 (𝑠ℎ𝑝) (𝑟𝑎𝑡𝑒) ⎥ 𝐿 𝛾 (𝑘,ℎ,ℎ ) + 𝛼, + 𝜉𝑣,𝑘 = [𝜉𝑣,𝑘 , 𝜉𝑣,𝑘 ] = ⎢ (𝑟𝑎𝑡𝑒) (𝑟𝑎𝑡𝑒) (𝑟𝑎𝑡𝑒) ⎥ ⎢ 𝑢∈𝑵 ,𝑅 ,𝑅 ∈𝑹 𝑢→𝑣 𝑢→𝑣 𝛽 𝜆𝑢,ℎ 𝜂𝑣,ℎ′ 𝑢∈𝑵  ,𝑅ℎ ,𝑅ℎ′ ∈𝑹 𝜋𝑢,𝑘  ℎ ℎ′ ⎣ ⎦ (𝑠ℎ𝑝) (𝑠ℎ𝑝) (𝑠ℎ𝑝) ( ) ( ⎡ 𝜋𝑢,𝑘 𝜂𝑣,ℎ′ 𝜉𝑣,𝑘 )⎤ ∑ ∑ 𝛿 (𝑠ℎ𝑝) (𝑟𝑎𝑡𝑒) (𝑘,ℎ,ℎ′ ) ⎥ 𝜆𝑢,ℎ = [𝜆𝑢,ℎ , 𝜆𝑢,ℎ ] = ⎢ 𝐿𝑢→𝑣 𝛾𝑢→𝑣 + 𝛿, + (𝑟𝑎𝑡𝑒) (𝑟𝑎𝑡𝑒) (𝑟𝑎𝑡𝑒) ⎥ ⎢ 𝑣∈𝑵 ,𝐶 ∈𝑪,𝑅 ∈𝑹 𝜖 𝜋 𝜂 𝜉 𝑣∈𝑵 ,𝐶 ∈𝑪,𝑅 ∈𝑹 ′ ′ ′  𝑘  𝑘 ⎣ ⎦ ℎ ℎ 𝑢,𝑘 𝑣,𝑘 𝑣,ℎ (𝑠ℎ𝑝) (𝑠ℎ𝑝) (𝑠ℎ𝑝) )⎤ ( ) ⎡( 𝜋 𝜆 𝜉 ∑ ∑ ′ 𝑢,𝑘 𝑢,ℎ 𝑣,𝑘 𝛿 (𝑠ℎ𝑝) (𝑟𝑎𝑡𝑒) ⎢ ⎥ 𝜂𝑣,ℎ′ = [𝜂𝑣,ℎ 𝐿 𝛾 (𝑘,ℎ,ℎ ) + 𝛿, + ′ , 𝜂𝑣,ℎ′ ] = (𝑟𝑎𝑡𝑒) (𝑟𝑎𝑡𝑒) (𝑟𝑎𝑡𝑒) ⎥ ⎢ 𝑢∈𝑵 ,𝐶 ∈𝑪,𝑅 ∈𝑹 𝑢→𝑣 𝑢→𝑣 𝜖 𝜋 𝜆 𝜉 𝑢∈𝑵  ,𝐶𝑘 ∈𝑪,𝑅ℎ ∈𝑹 𝑢,𝑘  𝑘 ℎ ⎣ ⎦ 𝑢,ℎ 𝑣,𝑘 [ ] [ ] [ ] [ ] (𝑠ℎ𝑝)



(𝑘,ℎ,ℎ ) 𝛾𝑢,𝑣 ∝𝑒

Ψ 𝜋𝑢,𝑘

(𝑟𝑎𝑡𝑒)

−𝑙𝑜𝑔𝜋𝑢,𝑘

(𝑠ℎ𝑝)

+Ψ 𝜆𝑢,ℎ

(𝑟𝑎𝑡𝑒)

−𝑙𝑜𝑔𝜆𝑢,ℎ +Ψ 𝜂

(𝑠ℎ𝑝) 𝑣,ℎ′

−𝑙𝑜𝑔𝜂

(𝑟𝑎𝑡𝑒) +Ψ 𝑣,ℎ′

(𝑠ℎ𝑝)

𝜉𝑣,𝑘

(9)

(10)

(11)

(12)

(𝑟𝑎𝑡𝑒)

−𝑙𝑜𝑔𝜉𝑣,𝑘

(13)

Box II.

6. Exploratory and predictive network analysis

The strength of the affiliation of a generic node 𝑢 ∈ 𝑵  to a given community 𝐶𝑘 (with 𝑘 = 1, … , 𝐾) and a certain role 𝑅ℎ (with ℎ = 1, … , 𝐻) is estimated via the following expectations

Algorithm 1 fits TORTILLA to a network , with 𝐾 latent communities and 𝐻 unknown roles. The resulting posterior 𝑞(Θ, Φ, 𝒁|𝝁) enables both tasks of Section 3.4 as detailed beneath.

(𝑠)

(𝑟)

𝜗𝑢,𝑘 = 𝐸[𝜗(𝑠) ] 𝑢,𝑘

𝜗𝑣,𝑘 = 𝐸[𝜗(𝑟) ] 𝑣,𝑘

𝜑(𝑠) = 𝐸[𝜑(𝑠) ] 𝑢,ℎ 𝑢,ℎ

𝜑(𝑟) = 𝐸[𝜑(𝑟) ] 𝑣,ℎ′ 𝑣,ℎ′

In practice, we explicitly allow 𝑢 to be not necessarily affiliated to 𝐶𝑘 . This is accomplished through an intuitive reasoning. More pre(⋅) cisely, 𝑢 is considered as actually affiliated to 𝐶𝑘 , as long as 𝜗𝑢,𝑘 is

6.1. Exploratory network analysis The goal of this task is to unveil the 𝐾 latent communities and 𝐻 unknown roles of . Both are revealed in terms of affiliation strengths.

(⋅)

sufficiently large, i.e., if 𝜗𝑢,𝑘 > 𝜁 , with 𝜁 being a suitable threshold. 8

G. Costa and R. Ortale

Engineering Applications of Artificial Intelligence 89 (2020) 103437

Fig. 6. Neural Network statistics. (⋅)

Otherwise, if 𝜗𝑢,𝑘 ≤ 𝜁, 𝑢 is considered as not affiliated to 𝐶𝑘 . Assume ≜ 𝑚𝑖𝑛𝑢 𝜑(𝑠) and 𝜑(𝑟) ≜ 𝑚𝑖𝑛𝑣 𝜑(𝑟) . By elaborating on the notion that 𝜑(𝑠) ℎ 𝑢,ℎ ℎ′ 𝑣,ℎ′ of background link probability (Yang et al., 2014), the definition of the threshold 𝜁 is √ ) ( √ √ 1 1 𝜁 = √− ∑ 𝑙𝑛 1 − (𝑠) (𝑟) 𝐻 |𝑽  | ℎ,ℎ′ =1 𝜑ℎ 𝜑ℎ′

In order to overcome the above limitations, we introduce a new model called QUESADILLA (QUantified nodE interactionS for grAph moDeling based on communIty and roLe affiLiAtions). The graphical representation of QUESADILLA and its generative process are illustrated in Figs. 3 and 4, respectively. Essentially, QUESADILLA is an extension of TORTILLA, in which network topology is conceived as the result of two types of node interactions. These can be explained as the interactions between nodes inside communities according to Yang and Leskovec (2014) and the interactions between the roles of nodes inside communities according to mixed-membership stochastic blockmodeling1 (Airoldi et al., 2008). As with TORTILLA, the establishment of a link from a node 𝑢 to another node 𝑣 is still governed through a Poisson distribution. However, under QUESADILLA, the parameterizing rate 𝜆𝑢,𝑣 is redefined through the below Eq. (14), with the aim to combine the two foresaid types of interactions.

6.2. Link prediction The estimation of the latent affiliation strengths allows for the calculation of the Poisson rates, that rule link establishment. Therefore, under TORTILLA, missing links 𝑢 → 𝑣 ∉ 𝑬  (Liben-Nowell and Kleinberg, 2007) are predicted by ranking such links through a score 𝑠𝑢→𝑣 , that is computed through the below expectation 𝑠𝑢→𝑣 = 𝐸[𝜆𝑢,𝑣 ]

𝜆𝑢,𝑣 =

where 𝜆𝑢,𝑣 is the rate of Eq. (1).

𝐾 ∑

𝐻 ∑

𝑘=1 ℎ=1,ℎ′ =1

𝜗(𝑠) 𝜑(𝑠) 𝜍 𝜗(𝑟) 𝜑(𝑟) 𝜍 𝜔(𝑠,𝑟) 𝑢,𝑘 𝑢,ℎ ℎ,𝑘 𝑣,𝑘 𝑣,ℎ′ ℎ′ ,𝑘 ℎ,ℎ′

(14)

In the above Eq. (14), 𝜍ℎ,𝑘 (with ℎ = 1, … , 𝐻 and 𝑘 = 1, … , 𝐾) is the degree to which the generic role 𝑅ℎ is affiliated to the arbitrary community 𝐶𝑘 . Besides, 𝜔(𝑠,𝑟) (with ℎ, ℎ′ = 1, … , 𝐻) is the strength ℎ,ℎ′ with which a sender (node playing) role 𝑅ℎ interacts with a receiver (node playing) role 𝑅ℎ′ , according to mixed-membership stochastic blockmodeling. Remarkably, 𝜔(𝑠,𝑟) (with ℎ, ℎ′ = 1, … , 𝐻) allows for ℎ,ℎ′ a more realistic and accurate incorporation of the actual contribution from node roles to link establishment. Also, 𝜍ℎ,𝑘 (with ℎ = 1, … , 𝐻 and 𝑘 = 1, … , 𝐾) enforces a hierarchical relationship between communities

7. Community- and role-affiliation modeling under QUESADILLA TORTILLA suffers from the following three limitations. • Firstly, affiliations to communities and roles are combined through Eq. (1) into symmetric interactions. Such a symmetry may not lead to infer the former as neatly different from the latter. • Secondly, the affiliations of roles to communities are not explicitly modeled. This prevents from gaining an understanding of the specificity of roles to communities. • Thirdly, the strength of interaction between roles is ignored. As a consequence, the contribution of node roles to link establishment is not fully captured.

1 Stochastic blockmodeling is a powerful generalization of structural equivalence (Lorrain and White, 1971), a traditional approach to role analysis (Ross and Ahmed, 2015).

9

G. Costa and R. Ortale

Engineering Applications of Artificial Intelligence 89 (2020) 103437

Fig. 7. Twitter statistics.

In particular, the Enron repository2 consists of emails authored by 158 employees in the Enron Corporation. The cleaned dataset includes nearly 250,000 email messages, that involve 150 employees, although the actual number of different employees amounts to 148, since there are 2 employees having as many usernames. The test-bed for TORTILLA and QUESADILLA is a communication network, that is implicitly formed by 18,233 emails, exchanged among the foresaid 148 employees. Neural Network (Watts and Strogatz, 1998) is a directed network representing the structure and connectivity of nervous system of a soil worm, referred to as Nematode Caenorhabditis, that involves 306 neurons with 2345 connections (White et al., 1986). TORTILLA and QUESADILLA are tested on the largest connected component of Neural Network, which includes 297 neurons. Twitter 3 is crawled in McAuley and Leskovec (2012) from 973 egonetworks. The data set encompasses 4869 circles along with 81,306 nodes and 1, 768, 149 directed links. We also consider 𝑇 𝑤𝑖𝑡𝑡𝑒𝑟20% , that is formed by randomly sampling 20% of all links inside the whole Twitter network. Google+4 is assembled in McAuley and Leskovec (2012) from 133 ego-networks with 479 circles and includes 107,614 nodes along with 13, 673, 453 directed edges. The in-degree and out-degree distributions of the selected networks are illustrated in Figs. 5 to 9.

and roles, that explicitly reveals the inherently characteristic roles of communities. Moreover, due to 𝜍ℎ,𝑘 (with ℎ = 1, … , 𝐻 and 𝑘 = 1, … , 𝐾) and 𝜔(𝑠,𝑟) (with ℎ, ℎ′ = 1, … , 𝐻), node affiliations are combined through ℎ,ℎ′ Eq. (14) into asymmetric interactions. This ensures that communities and roles are inferred from distinct groups of respective node affiliations. Yet, 𝜍ℎ,𝑘 (with ℎ = 1, … , 𝐻 and 𝑘 = 1, … , 𝐾) and 𝜔(𝑠,𝑟) (with ℎ,ℎ′ ℎ, ℎ′ = 1, … , 𝐻) are suitably accommodated into QUESADILLA, so that the requirements (1), (2) and (3) of Section 4 are still met. The derivation of variational inference under QUESADILLA along with the implementation of the mathematical updates into a respective variational algorithm are omitted, being similar to the developments in Section 5. 8. Experimental evaluation TORTILLA and QUESADILLA are investigated both quantitatively and qualitatively. The quantitative evaluation in Section 8.2 is aimed to comparatively assess the performance of TORTILLA and QUESADILLA in community discovery (through compactness), link prediction (via AUC and ROC analysis) and scalability (by means of runtime analysis with network size). The qualitative evaluation in Section 8.3, elucidates the output of both approaches on real-world networks.

8.2. Quantitative evaluation

8.1. Datasets

An in-depth empirical analysis is carried out below to contrast TORTILLA and QUESADILLA against various competitors, in terms

Five real-world directed networks from the social, biological and information domains are selected to comparatively experiment with TORTILLA and QUESADILLA. The selected datasets are Enron, Neural Network, Twitter, 𝑇 𝑤𝑖𝑡𝑡𝑒𝑟20% and Google+, with Twitter as well as Google+ being popular (very-)large-scale networks.

2 3 4

10

http://www.cs.cmu.edu/~enron/. http://snap.stanford.edu/data/ego-Twitter.html. http://snap.stanford.edu/data/egonets-Gplus.html.

G. Costa and R. Ortale

Engineering Applications of Artificial Intelligence 89 (2020) 103437

Fig. 8. 𝑇 𝑤𝑖𝑡𝑡𝑒𝑟20% statistics.

of compactness (Zhang et al., 2007) of the uncovered communities, link-prediction as well as scalability.

i.e., combination, interneuron, motor and sensory (Chatterjee and Sinha, 2008). As far as Twitter as well as 𝑇 𝑤𝑖𝑡𝑡𝑒𝑟20% are regarded, 𝐾 = 4, 869 is the total number of constituting circles. A suitable value for 𝐻 is searched for in the interval [2, 8]. 𝐻 = 4 is empirically chosen as the value, that enables the maximum gain in community discovery and link prediction. Lastly, on Google+, 𝐾 = 479 is, again, the total number of constituting circles. 𝐻 = 4 is the best empirical setting in the interval [2, 8]. On each of the selected networks, LDA-G finds the same number of communities as the one set for all other competitors.

8.2.1. Competitors The chosen competitors include state-of-the-art Bayesian generative approaches to the unified analysis of overlapping communities and roles, i.e., BH-CRM (Costa and Ortale, 2012), BLFHM (Costa and Ortale, 2014) and BH-CRM𝐿𝑃 (Costa and Ortale, 2013). Besides, LDAG (Henderson and Rad, 2009) is considered as a further competing approach to role-unaware community discovery. The comparison of TORTILLA and QUESADILLA against LDA-G is useful to substantiate whether awareness of node roles actually enhances model performance.

8.2.3. Community compactness The performance of all competitors in community discovery is evaluated in terms of compactness. The latter is the average shortest distance among the nodes with strongest affiliation to a community (Zhang et al., 2007). Essentially, the adoption of compactness can be explained as a network-centric reinterpretation of a well-known clustering criterion, i.e., intra-cluster distance, for the purpose of assessing the cohesiveness of the discovered communities. Formally, let 𝑪 = {𝐶1 , … , 𝐶𝐾 } be the set of 𝐾 uncovered communities. Also, assume that 𝑛(𝑘) , … , 𝑛(𝑘) 𝑞 are the top-𝑞 members of the generic 1 community 𝐶𝑘 ∈ 𝑪, ranked by the strength of their affiliation to 𝐶𝑘 . Compactness for 𝐶𝑘 is defined as

8.2.2. Experimental setup In order to train TORTILLA and QUESADILLA, the selected networks are preliminarily divided into corresponding training, validation as well as test sets through random sampling. Specifically, for each selected network, 70% of all links are included into the training set, 15% into the test set and the residual 15% into a held-out validation set. Test and held-out validation sets also share the same number of missing links. On the selected networks, a same setting for 𝐾 and 𝐻 is adopted across TORTILLA, QUESADILLA, BH-CRM, BLFHM and BH-CRM𝐿𝑃 . In particular, 𝐾 = 8 is the number of well-linked and topically meaningful communities within Enron according to Pathak et al. (2008). Besides, 𝐻 = 4 is the number of roles in Creamer et al. (2009), where four groups of such roles are pointed out, i.e., Employee (E), Trader (T ), Middle Manager (MM ) and Senior Manager (SM ). Regarding Neural Network, 𝐾 = 5 is the number of anatomical clusters, that match experimentally-identified functional circuits (Sohn et al., 2011). Moreover, 𝐻 = 4 indicates the different function types of neurons,

(𝐶𝑘 ) =

𝑞 ∑ 𝑞 ∑ 2 𝑑 (𝑘) (𝑘) 𝑞(𝑞 − 1) 𝑓 =1 𝑡>𝑓 𝑛𝑓 ,𝑛𝑡

where 𝑑⋅,⋅ denotes the shortest distance between any two nodes. In turn, compactness for all discovered communities is calculated by averaging ∑ compactness for the individual communities, i.e., (𝑪) = 𝑘1 𝐾 𝑘=1 (𝐶𝑘 ). The lesser (𝑪), the better the performance in community discovery. 11

G. Costa and R. Ortale

Engineering Applications of Artificial Intelligence 89 (2020) 103437

Fig. 9. Google+ statistics.

In all tests, for each uncovered community, we compute compactness from the top 15 members with strongest affiliation and, accordingly, set 𝑞 = 15. Table 1 summarizes compactness results. QUESADILLA finds the most compact communities on Enron, Neural Network and 𝑇 𝑤𝑖𝑡𝑡𝑒𝑟20% . TORTILLA is the runner-up. In addition, only QUESADILLA and TORTILLA successfully scale to process Twitter and Google+, with TORTILLA being overcome by QUESADILLA. The − symbol reported in the entries of Tables 1 and 2 means that the performance of BHCRM, BLFHM, BH-CRM𝐿𝑃 and LDA-G on Twitter and Google+ is not available. Unfortunately, we were not able to test BH-CRM, BLFHM, BH-CRM𝐿𝑃 as well as LDA-G on Twitter and Google+ within a reasonable computational time. This motivates the additional comparison on 𝑇 𝑤𝑖𝑡𝑡𝑒𝑟20% , in addition to highlighting the significant enhancement in scalability achieved by TORTILLA and QUESADILLA, that is explicitly investigated at Section 8.2.5.

of absent and present links, whose sum corresponds to 15% of all links in the input network. At the second stage, the competing models are inferred from the training set and exploited for predicting the links of the test set. Link prediction under TORTILLA is covered in Section 6.2. By analogy, the reasoning reported in Section 6.2 also applies to QUESADILLA. Link prediction under BH-CRM, BLFHM, BH-CRM𝐿𝑃 as well as LDA-G is detailed in Costa and Ortale (2013, 2014). Figs. 10a, 10b, 10c show the best ROC curve delivered by each competitor among the ones observed across the above 10 experiments over Enron, Neural Network as well as 𝑇 𝑤𝑖𝑡𝑡𝑒𝑟20% , respectively. Additionally, Figs. 10d and 10e show the best ROC curve for TORTILLA and QUESADILLA on 𝑇 𝑤𝑖𝑡𝑡𝑒𝑟 and 𝐺𝑜𝑜𝑔𝑙𝑒+, respectively. On 𝑇 𝑤𝑖𝑡𝑡𝑒𝑟20% , the predictive performance of TORTILLA and QUESADILLA overcomes the predictive performance of all other competitors throughout the range of false positive rates. On Enron as well as Neural Network, TORTILLA and QUESADILLA overcome BLFHM, BH-CRM, BH-CRM𝐿𝑃 and LDA-G, throughout very wide subranges of the false positive rate. Nonetheless, the link prediction performance of TORTILLA and QUESADILLA is not dominant across the entire range of false positive rates. Thus, we investigate the Area under the ROC Curve (AUC), in order to evaluate the average predictive performances across the 10 tests on Enron as well as Neural Network. The AUC values regarding TORTILLA and QUESADILLA on 𝑇 𝑤𝑖𝑡𝑡𝑒𝑟20% , Twitter and Google+ are reported for completeness. Table 2 summarizes the average AUC values for all competitors across the 10 tests on the selected networks. QUESADILLA outdoes all other competitors in link prediction. TORTILLA is again found to be the runner-up. Noticeably, QUESADILLA, TORTILLA, BLFHM and BH-CRM overcome LDA-G in link prediction. Moreover, as reported

8.2.4. Link prediction The predictive power of TORTILLA and QUESADILLA is comparatively evaluated through link prediction. The rationale behind the choice of link prediction is that the goodness of network connectivity models can be assessed by the degree to which such models reliably predict the presence/absence of links between nodes (Henderson and Rad, 2009; Henderson et al., 2010). The tests for assessing the link prediction performance of the competing models are designed as in Henderson et al. (2010). Specifically, 10 experiments are conducted on the selected networks. The generic experiment is a two-stage process. At the first stage, a training and a held-out test sets are formed by dividing the input network. More in detail, the test set is grown by randomly sampling the input network, with the aim to choose a same number 12

G. Costa and R. Ortale

Engineering Applications of Artificial Intelligence 89 (2020) 103437

Fig. 10. ROC analysis.

in Table 1, TORTILLA and QUESADILLA along with BLFHM (as well as BH-CRM in spite of the exception on Neural Network) found more compact communities in comparison with LDA-G. Such an empirical evidence corroborates the rationality of jointly considering communities and roles in link establishment, to more accurately capture and predict network connectivity. From this perspective, BH-CRM𝐿𝑃 is essentially competitive in link prediction with LDA-G, despite its link-partitioning generative semantics. Furthermore, the superiority of QUESADILLA and TORTILLA in community discovery and link prediction in comparison with BLFHM substantiates the benefit due to the incorporation of the latent weighted

generalization of directed affiliation modeling discussed at Section 3.3. Yet, the superiority of QUESADILLA with respect to TORTILLA, in community discovery as well as link prediction, corroborates that the former more accurately captures node interaction in link establishment compared to the latter. 8.2.5. Scalability Lastly, we investigate the scalability of TORTILLA and QUESADILLA with the size of the underlying network. Accordingly, we evaluate runtime over randomly drawn samples of Twitter and Google+ with a growingly larger number of retained links. Fig. 11 shows the 13

G. Costa and R. Ortale

Engineering Applications of Artificial Intelligence 89 (2020) 103437

Fig. 11. Scalability of TORTILLA and QUESADILLA with network size.

ALGORITHM 1: The Coordinate-Ascent Algorithm Input: An adjacency matrix 𝑳 of an observed network ; the number 𝐾 of latent communities; the number 𝐻 of underlying behavioral roles; Output: the strengths Θ and Φ of node affiliations to the 𝐾 communities and 𝐻 roles 1 set all variational parameters 𝜋𝑢,𝑘 , 𝜉𝑣,𝑘 , 𝜆𝑢,𝑘 , 𝜂𝑣,ℎ′ equal (except for a random offset) to the prior on the corresponding latent variables (Gopalan et al., 2015); 2 repeat 3 for each 𝑢 → 𝑣 such that 𝐿𝑢→𝑣 = 1 do 4 for each 𝑘 = 1, [ … , 𝐾 ]do (𝑠ℎ𝑝) (𝑟𝑎𝑡𝑒) 5 𝜅𝑢,𝑘 = 𝛹 𝜋 − 𝑙𝑜𝑔𝜋 ; 𝑢,𝑘 𝑢,𝑘 [ ] (𝑠ℎ𝑝) (𝑟𝑎𝑡𝑒) 6 𝜏𝑣,𝑘 = 𝛹 𝜉 − 𝑙𝑜𝑔𝜉 ; 𝑣,𝑘 𝑣,𝑘 7 for each ℎ, ℎ′ [= 1, … ], 𝐻 do (𝑠ℎ𝑝) (𝑟𝑎𝑡𝑒) 8 𝜈𝑢,ℎ = 𝛹 𝜆 − 𝑙𝑜𝑔𝜆 ; 𝑢,ℎ 𝑢,ℎ [ ] (𝑠ℎ𝑝) (𝑟𝑎𝑡𝑒) 𝜐𝑣,ℎ′ = 𝛹 𝜂 ′ − 𝑙𝑜𝑔𝜂 ′ ; 𝑣,ℎ 𝑣,ℎ ′ 𝜅 +𝜈 +𝜐 (𝑘,ℎ,ℎ ) ′ +𝜂𝑣,𝑘 𝛾𝑢,𝑣 ∝ 𝑒 𝑢,𝑘 𝑢,ℎ 𝑣,ℎ ;

9 10 11 12 13 14 15 16

17

18

19 20 21 22

23

24

25

end end end for each 𝑢 ∈ 𝑵  do for each 𝑘 = 1,[… , 𝐾 do ] ∑ (𝑠ℎ𝑝) (𝑘,ℎ,ℎ′ ) 𝜋 = + 𝛼; 𝑣∈𝑵  ,𝑅ℎ ,𝑅ℎ′ ∈𝑹 𝐿𝑢→𝑣 𝛾𝑢→𝑣 𝑢,𝑘 ⎡ (𝑠ℎ𝑝) 𝜂 (𝑠ℎ𝑝) (𝑠ℎ𝑝) 𝜆 𝜉 (𝑟𝑎𝑡𝑒) ⎢∑ 𝑣,ℎ′ 𝑣,𝑘 𝜋 = ⎢ 𝑣∈𝑵 ,𝑅 ,𝑅 ∈𝑹 𝑢,ℎ (𝑟𝑎𝑡𝑒) (𝑟𝑎𝑡𝑒) (𝑟𝑎𝑡𝑒) 𝑢,𝑘  ℎ ℎ′ 𝜆 𝜂 ′ 𝜉 ⎢ 𝑢,ℎ 𝑣,𝑘 ⎣ 𝑣,ℎ [ ] ∑ (𝑠ℎ𝑝) (𝑘,ℎ′ ,ℎ) 𝜉 = + 𝛼; 𝑣∈𝑵  ,𝑅ℎ ,𝑅ℎ′ ∈𝑹 𝐿𝑣→𝑢 𝛾𝑣→𝑢 𝑢,𝑘 ⎡ (𝑠ℎ𝑝) (𝑠ℎ𝑝) 𝜂 (𝑠ℎ𝑝) 𝜆 𝜋 (𝑟𝑎𝑡𝑒) ⎢∑ 𝑣,ℎ 𝑢,ℎ′ 𝜉 = ⎢ 𝑣∈𝑵 ,𝑅 ,𝑅 ∈𝑹 𝑣,𝑘 (𝑟𝑎𝑡𝑒) (𝑟𝑎𝑡𝑒) (𝑟𝑎𝑡𝑒) 𝑢,𝑘  ℎ ℎ′ 𝜋 𝜆 𝜂 ′ ⎢ 𝑣,𝑘 𝑣,ℎ ⎣ 𝑢,ℎ

Fig. 12. Role distributions across the 8 Enron communities of Table 3.

Table 1 Compactness results.

⎤ ⎥ 𝛼 ⎥+ 𝛽 ; ⎥ ⎦

⎤ ⎥ 𝛼 ⎥+ 𝛽 ; ⎥ ⎦

end for each ℎ = 1,[ … , 𝐻 do ] ∑ (𝑠ℎ𝑝) (𝑘,ℎ,ℎ′ ) 𝜆 = + 𝛿; 𝑣∈𝑵  ,𝐶𝑘 ∈𝑪,𝑅ℎ′ ∈𝑹 𝐿𝑢→𝑣 𝛾𝑢→𝑣 𝑢,ℎ ⎡ (𝑠ℎ𝑝) 𝜂 (𝑠ℎ𝑝) (𝑠ℎ𝑝) 𝜋 𝜉 (𝑟𝑎𝑡𝑒) ⎢∑ 𝑣,ℎ′ 𝑣,𝑘 𝜆 = ⎢ 𝑣∈𝑵 ,𝐶 ∈𝑪,𝑅 ∈𝑹 𝑢,𝑘 (𝑟𝑎𝑡𝑒) (𝑟𝑎𝑡𝑒) (𝑟𝑎𝑡𝑒) 𝑢,ℎ  𝑘 ℎ′ 𝜋 𝜂 ′ 𝜉 ⎢ 𝑢,𝑘 𝑣,𝑘 ⎣ 𝑣,ℎ [ ] ∑ (𝑠ℎ𝑝) (𝑘,ℎ′ ,ℎ) 𝜂 = + 𝛿; 𝑣∈𝑵  ,𝐶𝑘 ∈𝑪,𝑅ℎ′ ∈𝑹 𝐿𝑣→𝑢 𝛾𝑣→𝑢 𝑢,ℎ ⎡ (𝑠ℎ𝑝) 𝜆(𝑠ℎ𝑝) (𝑠ℎ𝑝) 𝜋 𝜉 (𝑟𝑎𝑡𝑒) ⎢∑ 𝑣,ℎ′ 𝑢,𝑘 𝜂 = ⎢ 𝑣∈𝑵 ,𝐶 ∈𝑪,𝑅 ∈𝑹 𝑣,𝑘 (𝑟𝑎𝑡𝑒) (𝑟𝑎𝑡𝑒) (𝑟𝑎𝑡𝑒) 𝑢,ℎ  𝑘 ℎ′ 𝜋 𝜆 ′ 𝜉 ⎢ 𝑣,𝑘 𝑢,𝑘 ⎣ 𝑣,ℎ

Network

QUESADILLA TORTILLA BLFHM BH-CRM BH-CRM𝐿𝑃 LDA-G

Enron Neural Network 𝑇 𝑤𝑖𝑡𝑡𝑒𝑟20% Twitter Google+

2.64 3.28 5.43 4.50 3.75

2.71 3.33 5.64 4.79 3.89

2.78 3.45 5.98 – –

3.06 4.13 6.04 – –

3.30 4.23 6.42 – –

3.69 3.65 6.21 – –

Table 2 Average AUC results.

⎤ ⎥ 𝛿 ⎥+ 𝜖 ; ⎥ ⎦

⎤ ⎥ 𝛿 ⎥+ 𝜖 ; ⎥ ⎦

26 end 27 end 28 until convergence;

Network

QUESADILLA

TORTILLA

BLFHM

BH-CRM

BH-CRM𝐿𝑃

LDA-G

Enron Neural Network 𝑇 𝑤𝑖𝑡𝑡𝑒𝑟20% Twitter Google+

87.28 84.95 83.81 86.36 79.86

85.58 83.87 82.73 85.11 78.18

82.16 78.27 74.28 – –

80.40 76.70 71.61 – –

74.81 67.43 69.05 – –

75.70 66.52 67.24 – –

In particular, Table 3 summarizes the 8 discovered overlapping communities into their respective top-5 nodes with strongest affiliation. Fig. 12 illustrates the occurrence frequency of roles within the 8 communities of Table 3.

trend of time efficiency of TORTILLA and QUESADILLA on small-, medium- and large-scale networks. The substantially linear scalability of TORTILLA and QUESADILLA with network size highlights the benefit of the Poisson distribution and variational inference under both models.

9. Conclusions We proposed two innovative approaches to the joint modeling and simultaneous detection of overlapping communities and roles in networks. Both approaches consist in performing posterior inference under as many innovative network models, in which topology is the result of realistic and intuitive Bayesian probabilistic processes. In order to associate nodes with communities and roles, the devised models

8.3. Qualitative evaluation The output of QUESADILLA on real-word networks is elucidated through the inspection of the results observed on Enron when 𝐾 = 8 and 𝐻 = 4 (such a setting is motivated in Section 8.2). 14

G. Costa and R. Ortale

Engineering Applications of Artificial Intelligence 89 (2020) 103437

Table 3 A summary of the Enron overlapping communities uncovered by QUESADILLA. Community 1

Community 2

Community 3

Community 4

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

Dean Kean Hernandez Williams Gilbertsmith

Presto Steffes Kean Hyatt Rodrique

Sanders Platter Kean Ring Stepenovitch

Smith Beck Steffes Rodrique Badeer

Community 5

Community 6

Community 7

Community 8

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

Fischer Badeer Watson Mccarty Scott

Kuykendall Hyatt Kean Quigley Sager

Williams Lewis Blair Germany Rodrique

Costa, G., Ortale, R., 2014. A unified generative bayesian model for community discovery and role assignment based upon latent interaction factors. In: Proc. of the IEEE/ACM Int. Conf. on Advances in Social Networks Analysis and Mining, pp. 93–100. Costa, G., Ortale, R., 2016a. A mean-field variational bayesian approach to detecting overlapping communities with inner roles using poisson link generation. In: Proc. of International Symposium on Intelligent Data Analysis, pp. 110–122. Costa, G., Ortale, R., 2016b. Model-based collaborative personalized recommendation on signed social rating networks. ACM Trans. Internet Technol. 16 (3), 20:1–20:21. Costa, G., Ortale, R., 2016c. Scalable detection of overlapping communities and role assignments in networks via bayesian probabilistic generative affiliation modeling. In: Proc. of International OTM Conference on Cooperative Information Systems, pp. 99–117. Costa, G., Ortale, R., 2018. Mining overlapping communities and inner role assignments through bayesian mixed-membership models of networks with context-dependent interactions. ACM Trans. Knowl. Discov. Data 12 (2), 18:1 – 18:32. Creamer, G., Rowe, R., Hershkop, S., Stolfo, S., 2009. In: Zhang, H., Spiliopoulou, M., Mobasher, B., Giles, C.L., Mccallum, A., Nasraoui, O., Srivastava, J., Yen, J. (Eds.), Advances in Web Mining and Web Usage Analysis. Springer-Verlag, pp. 40–58. Evans, T., 2010. Clique graphs and overlapping communities. J. Stat. Mech. P12037. Evans, T., Lambiotte, R., 2009. Line graphs, line partitions and overlapping communities. Phys. Rev. E 80, 016105. Evans, T., Lambiotte, R., 2010. Line graphs of weighted networks for overlapping communities. Eur. Phys. J. B 77 (2), 265–272. Fortunato, S., 2010. Community detection in graphs. Phys. Rep. 486 (3–5), 75–174. Fortunato, S., Hric, D., 2016. Community detection in networks: A user guide. Phys. Rep. 659, 1–44. Girvan, M., Newman, M., 2002. Community structure in social and biological networks. In: Proc. of the National Academy of Sciences, vol. 99, pp. 7821–7826, No. 12. Gopalan, P., Blei, D., 2013. Efficient discovery of overlapping communities in massive networks. Proc. Natl. Acad. Sci. 110 (36), 14534–14539. Gopalan, P., Hofman, J., Blei, D., 2015. Scalable recommendation with hierarchical poisson factorization. In: Proc. of Uncertainty in Artificial Intelligence, pp. 326–335. Gopalan, P., Wang, C., Blei, D., 2013. Modeling overlapping communities with node popularities. In: Proc. of Advances in Neural Information Processing Systems, pp. 2850–2858. Henderson, K., Eliassi-Rad, T., Papadimitriou, S., Faloutsos, C., 2010. Hcdf: A hybrid community discovery framework. In: Proc. of SIAM Conference on Data Mining, pp. 754–765. Henderson, K., Rad, T.E., 2009. Applying latent dirichlet allocation to group discovery in large graphs. In: ACM SAC. pp. 1456–1461. Kernighan, B., Lin, S., 1970. An efficient heuristic procedure for partitioning graphs. Bell Syst. Tech. J. 49 (1), 291–307. Kim, Y., Jeong, H., 2011. Map equation for link community. Phys. Rev. E 84, 026110. Kolaczyk, E., 2009. Statistical Analysis of Network Data. Springer. Liben-Nowell, D., Kleinberg, J., 2007. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58 (7), 1019–1031. Lorrain, F., White, H., 1971. The structural equivalence of individuals in social networks. J. Math. Sociol. 1, 49–80. Ma, Z., Sun, A., Yuan, Q., Cong, G., 2015. A tri-role topic model for domain-specific question answering. In: Proc. of AAAI Conf. on Artificial Intelligence, pp. 224–230. McAuley, J., Leskovec, J., 2012. Learning to discover social circles in ego networks. In: Proc. of Advances in Neural Information Processing Systems, pp. 548–556. McCallum, A., Wang, X., Corrada-Emmanuel, A., 2007. Topic and role discovery in social networks with experiments on enron and academic email. J. Artificial Intelligence Res. 30 (1), 249–272. Newman, M., 2004a. Detecting community structure in networks. Eur. Phys. J. B 38 (2), 321–330. Newman, M., 2004b. Fast algorithm for detecting community structure in networks. Phys. Rev. E 69, 066133. Newman, M., Girvan, M., 2004. Finding and evaluating community structure in networks. Phys. Rev. E 69 (2), 026113. Pathak, N., Delong, C., Banerjee, A., Erickson, K., 2008. Social topic models for community extraction. In: Proc. of KDD Workshop on Social Network Mining and Analysis. Pothen, A., Simon, H., Liou, K.-P., 1990. Partitioning sparse matrices with eigenvectors of graphs. SIAM J. Matrix Anal. Appl. 11 (3), 430–452. Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D., 2004. Defining and identifying communities in networks. In: Proc. of the National Academy of Sciences of the United States of America, vol. 101, pp. 2658–2663, No. 9. Ross, R., Ahmed, N., 2015. Role discovery in networks. IEEE Trans. Knowl. Data Eng. 27 (04), 1112–1131. Scripps, J., Tan, P.-N., Esfahanian, A.-H., 2007. Exploration of link structure and community-based node roles in network analysis. In: Proc. of Int. Conf. on Data Mining, pp. 649–654. Scripps, J., Tan, P.-N., Esfahanian, A.-H., 2007. Node roles and community structure in networks. In: Proc. of Workshop on Web Mining and Social Network Analysis (WebKDD and SNA-KDD), pp. 26–35. Sohn, Y., Choi, M.-K., Ahn, Y.-Y., Lee, J., Jeong, J., 2011. Topological cluster analysis reveals the systemic organization of the caenorhabditis elegans connectome. PLoS Comput. Biol. 7 (5), e1001139.

Linder Salisbury Rodrique Guzman Hyvl

rely on latent weighted directed affiliations from Gamma distributions. This allows for communities with heterogeneous connectivity structures, realistic overlaps and hierarchical nestings. Moreover, Poisson distributions are adopted in both models to govern link establishment according to the extent of interaction between nodes. In particular, under the second model, node interactions are also influenced by the strength of role-to-role interactions, that are in turn captured through mixed-membership stochastic blockmodeling. In the context of the developed models, the Poisson distributions also expedite inference on sparse networks. The mathematical details of mean-field variational inference were derived and implemented into a coordinateascent variational algorithm, for the exploratory and unsupervised analysis of node affiliations. A comparative empirical assessment on several real-world networks showed the superior performance of the devised approaches in community discovery, link prediction as well as scalability. Future research involves refining the affiliations of nodes, so as to also consider their attributes (Yang et al., 2013). It is also interesting to investigate the exploitation of community-specific user roles for the enhancement of social recommendation (Costa and Ortale, 2016b). References Ahn, Y., Bagrow, J., Lehmann, S., 2010. Link communities reveal multiscale complexity in networks. Nature 466, 761–764. Airoldi, E., Blei, D., Fienberg, S., Xing, E., 2008. Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9, 1981–2014. Arabie, P., Boorman, S., Levitt, P., 1978. Constructing blockmodels: How and why. J. Math. Psych. 17 (1), 21–63. Biddle, B., 1986. Recent development in role theory. Ann. Rev. Sociol. 12, 67–92. Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Springer. Bishop, C., 2013. Model-based machine learning. Phil. Trans. R. Soc. A 371, 20120222. Bishop, Y., Fienberg, S., Holland, P., 2007. Discrete Multivariate Analysis: Theory and Practice. Springer. Blei, D., 2014. Build, compute, critique, repeat: Data analysis with latent variable models. Annu. Rev. Stat. Appl. 1, 203–232. Blei, D., Kucukelbir, A., McAuliffe, J., 2017. Variational inference: A review for statisticians. J. Amer. Statist. Assoc. 112 (518), 859–877. Blei, D., Lafferty, J., 2009. In: Srivastava, A., Sahami, M. (Eds.), Text Mining: Classification, Clustering, and Applications. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, pp. 71–94. Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022. Chatterjee, N., Sinha, S., 2008. Understanding the mind of a worm: Hierarchical network structure underlying nervous system function in c. elegans. In: Banerjee, R., Chakrabarti, B. (Eds.), Progress in Brain Research. Elsevier B.V., pp. 145–153. Chou, B.-H., Suzuki, E., 2010. Discovering community-oriented roles of nodes in a social network. In: Proc. of Int. Conf. on Data Warehousing and Knowledge Discovery, pp. 52–64. Costa, G., Ortale, R., 2012. A bayesian hierarchical approach for exploratory analysis of communities and roles in social networks. In: Proc. of the IEEE/ACM Int. Conf. on Advances in Social Networks Analysis and Mining, pp. 194–201. Costa, G., Ortale, R., 2013. Probabilistic analysis of communities and inner roles in networks: bayesian generative models and approximate inference. Soc. Netw. Anal. Min. 3 (4), 1015–1038. 15

G. Costa and R. Ortale

Engineering Applications of Artificial Intelligence 89 (2020) 103437 Xu, F., Ji, Z., Wang, B., 2012. Dual role model for question recommendation in community question answering. In: Proc. of Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 771–780. Yang, J., Leskovec, J., 2014. Structure and overlaps of ground-truth communities in networks. ACM Trans. Intell. Syst. Technol. (TIST) 5 (2), 26:1 – 26:35. Yang, J., McAuley, J., Leskovec, J., 2013. Community detection in networks with node attributes. In: Proc. of Int. Conf. on Data Mining, pp. 1151–1156. Yang, J., McAuley, J., Leskovec, J., 2014. Detecting cohesive and 2-mode communities in directed and undirected networks. In: Proc. of ACM Int. Conf. on Web Search and Data Mining, pp. 323–332. Yuan, S., Zhang, Y., Tang, J., Hall, W., Cabotà, J., 2019. Expert finding in community question answering: A review. Artif. Intell. Rev. 1–32. Zhang, H., Qiu, B., Giles, C., Foley, H., Yen, J., 2007. An lda-based community structure discovery approach for large-scale social networks. In: Proc. of IEEE Int. Conf. on Intelligence and Security Informatics, pp. 200–207. Zhao, W., Wang, J., He, Y., Nie, J.-Y., Wen, J.-R., Li, X., 2015. Incorporating social role theory into topic models for social media content analysis. IEEE Trans. Knowl. Data Eng. 27 (4), 1032–1044. Zhou, D., Manavoglu, E., Li, J., Giles, C., Zha, H., 2006. Probabilistic models for discovering e-communities. In: Proc. of Int. Conf. on World Wide Web, pp. 173–182.

Srba, I., Bielikova, M., 2016. A comprehensive survey and classification of approaches for community question answering. ACM Trans. Web 10 (3), 18:1–18:63. Steyvers, M., Griffiths, T., 2007. In: Landauer, T., McNamara, S.D., Kintsch, W. (Eds.), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum, pp. 427–448. Wang, X., Huang, C., Yao, L., Benatallah, B., Dong, M., 2018. A survey on expert recommendation in community question answering. J. Comput. Sci. Tech. 33 (4), 625–653. Wasserman, S., Faust, K., 1994. Social Network Analysis: Methods and Applications. Cambridge University Press. Watts, D., Strogatz, S., 1998. Collective dynamics of ’small-world’ networks. Nature 393 (6684), 440–442. White, J., Southgate, E., Thompson, J., Brenner, S., 1986. The structure of the nervous system of the nematode caenorhabditis elegans. Philos. Trans. R. Soc. B 314 (1165), 1–340. Wu, Z., 2010. A fast and reasonable method for community detection with adjustable extent of overlapping. In: IEEE Int. Conf. on Intelligent Systems and Knowledge Engineering. pp. 376–379. Xie, J., Kelley, S., Szymanski, B., 2013. Overlapping community detection in networks: the state of the art and comparative study. ACM Comput. Surv. 45 (4).

16