Dependence clustering, a method revealing community structure with group dependence

Dependence clustering, a method revealing community structure with group dependence

Knowledge-Based Systems 60 (2014) 58–72 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/...

2MB Sizes 0 Downloads 85 Views

Knowledge-Based Systems 60 (2014) 58–72

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Dependence clustering, a method revealing community structure with group dependence Hyunwoo Park a, Kichun Lee b,⇑ a b

Industrial and Systems Engineering and Tennenbaum Institute, Georgia Institute of Technology, Atlanta, USA Industrial Engineering, Hanyang University, Seoul, Republic of Korea

a r t i c l e

i n f o

Article history: Received 1 May 2013 Received in revised form 30 December 2013 Accepted 6 January 2014 Available online 23 January 2014 Keywords: Group dependence Clustering Markovian Community structure Mutual information

a b s t r a c t We propose a clustering method maximizing a new measure called ‘‘group dependence.’’ Group dependence quantifies how precise a certain division of a graph is in terms of dependence distance. Built upon statistical dependence measure between points driven by Markovian transitions, group dependence incorporates the geometric structure of input data. Besides capturing degrees of positive dependence and coherence for a group division, group dependence inherently supplies the proposed clustering method with a definite decision on the depth of division. We provide an optimality aspect of the method as theoretical justification in consideration of posterior transition probabilities of input data. Illustrating its procedure using data from a known structure, we demonstrate its performance in the clustering task of real-world data sets, Amazon, DBLP, and YouTube, in comparison with selected clustering algorithms. We show that the proposed method outperforms the selected methods in reasonable settings: in particular, the proposed method surpasses modularity clustering in terms of normalized mutual information. We also show that the proposed method reveals additional insights on community structure detection according to its connectivity scale parameter. Ó 2014 Elsevier B.V. All rights reserved.

1. Introduction Identifying community structure in networks has been a central issue in many fields including sociology, bio-informatics, physics, and applied mathematics to name only a few. Community structure detection is a branch of the broader class problem: cluster analysis. Cluster analysis is an unsupervised task of assigning a set of objects into homogeneous groups. Depending on the number of groups in demand, the nature of clustering tasks can be divided into the following two kinds of problems. In the first case, the number of clusters is known when clustering is carried out. Graph partitioning is one line of research that fits this type of clustering. One of the most well known examples of such tasks that arise in computer science is assigning to multiple processors a number of inter-dependent tasks represented as a graph. Since the number of processors is likely fixed and known, a clustering algorithm that cannot consider the predefined number of clusters is of little practical use in this context. Community structure detection, on the other hand, pursues a slightly different goal in the task of clustering. In this setting, the number of communities is unknown beforehand. Not only grouping nodes precisely but also determining the ⇑ Corresponding author. Tel.: +82 222200478. E-mail addresses: [email protected], [email protected] (K. Lee). http://dx.doi.org/10.1016/j.knosys.2014.01.004 0950-7051/Ó 2014 Elsevier B.V. All rights reserved.

number of meaningful clusters latent in the graph structure is of great importance in this class of problems. Social network analysis falls into this category. Indeed, many clustering algorithms have been proposed in the data mining society. Among them is hierarchical clustering, k-means clustering, distribution-based clustering, and so forth. They are commonly based on the similarities or closeness among nodes. Han and Kamber [1] and Gan et al. [2] provide a thorough survey on details of the algorithms. Conceptually, in view of the number of clusters in demand, two approaches are possible in revealing the community structure in networks: agglomerate and divisive methods. Agglomerate methods start from grouping nodes with the highest similarity and repeat the process with recalculated similarities among groups and nodes. The agglomerate approach is more intuitive than the divisive approach, so it was developed earlier and has been widely used. Agglomerate hierarchical clustering is a representative example in this approach when the true number of clusters is unknown. This approach needs to be followed by an additional critical step that involves a decision criterion for the optimal number of clusters [3]. k-means clustering groups nodes in a similar way with predefined number of centroids. In contrast, divisive methods—possessing inherent rules for the optimal number of clusters—repeat cutting the network successively until no subdivision of the network yields gain. One

59

H. Park, K. Lee / Knowledge-Based Systems 60 (2014) 58–72

of the methods in this avenue is modularity-based clustering proposed by Newman [4]. Modularity measures how precise a division of the network is against a graph with edges placed at random and has played an essential role in detecting community structures in networks. One drawback of this method is that the random modularity measure employs a fixed global model, which assumes that each node can be linked to any other nodes of the network whether they are large or small regardless of the geometric structure of the network. Thus, it cannot adjust the level of resolution or the scale on which the modularity measure relies. Therefore, it would be desirable to propose another clustering method which not only includes an inherent rule for the optimal number of clusters, but also possesses flexibility in adjusting the level of scale from which it takes into account the geometric structure of the network. In this paper, we propose a clustering method maximizing a new measure called ‘‘group dependence.’’ Based on mutual information as well as posterior probabilities of network connections, group dependence provides flexibility in adjusting the scale on which the whole graph is viewed and the level of connectedness upon which a division of the network is evaluated in terms of dependence. Lee et al. [5] demonstrated the efficacy of the dependence concept and dependence distance in the context of dimensionality reduction. The statistical dependence measure between nodes which the proposed clustering method relies on is initially motivated by Markovian transitions among nodes and extends the concept of mutual information into a point-wise fashion. The dependence measure is also a lift measure between nodes that is widely used to capture the level of association in association rule learning. A graph can be viewed as being connected via a Markov chain, which means that the neighborhood of a node evolves through Markovian transitions. The adjacency matrix of the graph and the transition steps in the neighborhood of each node determine the neighborhood structure of the graph and the scale on which the whole graph is viewed. The degree of relative dependence for a group division against random division from the neighborhood structure, called as group dependence, is assessed as a coherence measure for the division, naturally leading to a clear answer on the division depth and a group configuration with maximized group dependence. Furthermore, the level of connectedness for subdivision is adjustable in a straightforward manner. We will describe the detailed machinery and performance of the algorithm in the following sections. This paper proceeds as follows. In Section 2, we start with defining group dependence as an extension of dependence distance. We then explain the machinery of the proposed dependence clustering and illustrate its use by clustering a simple data set. Section 3 compares performance of the dependence clustering with that of popular clustering methods such as hierarchical, spectral, and modularity clustering. We first run comparison on simulated data, and then use three real-world data sets: the karate club factions data by Zachary [6], tags co-occurrence network from Groupon, and large social and information network data from Amazon, DBLP,1 and YouTube. We conclude with discussion and future research issues in Section 4.

2.1. Dependence Suppose we have n data points in Rb . Each data point x1 ; . . . ; xn represents a node in an undirected graph. Denote the set of nodes by X ¼ fx1 ; . . . ; xn g. We view the graph as a Markov chain assuming that the whole chain is ergodic and all transitions follow the Markovian property. We can then define the neighborhood of a node as the nodes that can be reached through Markovian transitions from the focal node. This neighborhood structure provides a foundation to calculate the distance in a new measure between nodes in a graph. To illustrate the concept of neighborhood transitions, we provide Fig. 1, in which the node 3 is closer to the node 1 than the node 2 is to the node 1 in Euclidean distance using the arrows. However, in consideration of the edge structure representing Markovian transition between the nodes, the graph geometry suggests that the distance between nodes 1 and 2 ought to be smaller than that between nodes 1 and 3 in terms of neighborhood-transition steps: node 2 is four transition steps away from node 1 while node 3 is thirteen. Lee et al. [5] proposed ‘‘dependence distance’’ between two nodes for a Markov chain in the t-step-wide neighborhood evolution in X, where t is an exogenously given parameter. They demonstrated its use in their proposed dimensionality reduction algorithm, while we devise a new community structure detection algorithm based on it. So we summarize the concept briefly in this section and propose a new measure on dependence suitable for community structure detection in the next section. We assume that any two nodes in X can be connected via Markovian transitions, although the probability of a transition decays as the number of steps between the two nodes increases. Let us define X t as a random walk that represents a node (or state) at tth transition. We define the statistical dependence between a node in the initial state (X 0 ) and another node at step t (X t ) as follows: Definition 2.1. Dependence between xm ; xi 2 X, denoted by DepðX 0 ¼ m; X t ¼ iÞ, is

DepðX 0 ¼ m; X t ¼ iÞ ¼

PrðX t ¼ i; X 0 ¼ mÞ : PrðX t ¼ iÞPrðX 0 ¼ mÞ

ð2:1Þ

By definition, dependence is closely linked to the point-wise mutual information. The point-wise mutual information is widely used in information theory and statistics as a measure of association. Since mutual information IðX 0 ; X t Þ between two random variables X 0 and X t is the expectation of the point-wise mutual information for all realizations of X 0 and X t , we can express it in terms of dependence as follows:

2. The dependence clustering In this section, we first briefly summarize the concept of statistical dependence. We then introduce the concept of group dependence to measure coherence for a group division in terms of dependence, followed by a new clustering method based on group dependence. We also provide an optimality aspect of the proposed method and further details of the algorithm. 1 A computer science bibliography website; http://www.informatik.uni-trier.de/ ley/db/.

Fig. 1. The concept of neighborhood transitions is illustrated. The edge represents one step transition. Although node 3 (in green) is closer to node 1 (in red) in Euclidean distance denoted by arrows than node 2 (in blue) is to node 1, node 2 is closer to node 1 than node 3 is to node 1 in terms of neighborhood-transition steps denoted by edges. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

60

H. Park, K. Lee / Knowledge-Based Systems 60 (2014) 58–72

IðX 0 ; X t Þ ¼ E½log DepðX 0 ; X t Þ X ¼ PrðX 0 ¼ m; X t ¼ iÞ log DepðX 0 ¼ m; X t ¼ iÞ:

ð2:2Þ

xm ;xi 2X

Dependence extends positive and negative quadrant dependence (PQD, NQD) [7,8]. It differs from PQD and NQD in a way that it provides local properties of dependence at each possible outcome of two random variables, while PQD and NQD are global properties between the two random variables. Intuitively speaking, dependence captures how xm in the initial state is inter-dependent with xi at the tth step: DepðX 0 ¼ m; X t ¼ iÞ < 1 means that xm and xi are negatively dependent; DepðX 0 ¼ m; X t ¼ iÞ ¼ 1 means that they are independent; DepðX 0 ¼ m; X t ¼ iÞ > 1 means that they are positively dependent. Dependence is also related to the lift measure in association rule learning. For a rule A ! B, the lift measure, liftðA ! BÞ, is defined to be conf ðA ! BÞ=PðBÞ, where the confidence measure, conf ðA ! BÞ, is how sure consequence B is when antecedent A ocPðA\BÞ . The lift measure curred, PðBjAÞ. That is to say, liftðA ! BÞ ¼ PðAÞPðBÞ

less than 1 means the two events are negatively associated, and the lift measure greater than 1 means the two are positively associated, while the lift measure being equal to 1 means random association. In essence, the lift measure captures the performance of the rule against random association, being widely used to find strong association rules [9,10]. Thus, DepðX 0 ¼ m; X t ¼ iÞ becomes the lift measure between the two events X 0 ¼ m and X t ¼ i revealing how strongly the two events are associated. 2.2. Group dependence In this paper we propose a new measure called ‘‘group dependence’’ that quantifies the goodness-of-division of a graph. The group dependence, based on dependence in Section 2.1, captures how much data points in each group are dependent on each other. We start with the simple case of bisecting a graph and provide instructions on how to divide a graph into more than two groups in Section 2.4. Let si ¼ 1 if data point i belongs to group 1 and si ¼ 1 if it belongs to group 2. Point i is used interchangeably with node i. Observe that the quantity 12 ðsi sj þ 1Þ is 1 if i and j are in the same group and 0 otherwise. Denote the group assignment vector by s ¼ ½s1 ; . . . ; sn . Group dependence is defined as follows: Definition 2.2. Group dependence Dt for a given group assignment s and connectivity scale parameter t is

Dt ¼

1 X ðDepðX 0 ¼ i; X t ¼ jÞ  d0 Þðsi sj þ 1Þ: 2 x ;x 2X i

ð2:3Þ

j

The term DepðX 0 ¼ i; X t ¼ jÞ  d0 in (2.3) represents the degree of relative dependence or the degree of dependence in comparison to the baseline dependence level d0 . The positive sign of the term for two states i and j indicates that the two points are positively dependent, and large positive magnitude means dependence between the two nodes strong enough to exceed the baseline dependence level. It reasonably signifies assigning an equal group identity (si ¼ sj ) to the two points. The parameter d0 is the baseline dependence level against which the relative dependence is computed. By default, d0 is set to 1 for the level of independence. For d0 > 1, group dependence is the sum of differences between a dependence level of each pair of nodes and the positive predefined dependence level d0 , measuring the within-group excess dependence level for a group division. One could set d0 < 1 to detect subdivisions with negative dependence. Accordingly, group dependence Dt measures overall coherence in terms of dependence from group assignments for the whole data set at the t-step transition. The transition step t can adjust the level of connectivity scale on

which two points are associated. For instance, group dependence Dt for t ¼ 1 takes into account one-step neighbors of each node in building the whole coherence level, as is common to most clustering methods. For large t, group dependence Dt considers neighbors in more steps and large-scale connectivity in the neighborhood by the evolution of Markovian transitions. When infinite transition steps are taken, for the infinite value for t, the Markov chain converges to the stationary distribution and the group dependence becomes trivial by constant dependence PrðX 0 ¼i;X t ¼jÞ t ¼jjX 0 ¼iÞ DepðX 0 ¼ i; X t ¼ jÞ ¼ PrðX ¼ PrðXPrðX ¼ 1. The observation t ¼jÞ 0 ¼iÞPrðX t ¼jÞ

DepðX 0 ¼ i; X t ¼ jÞ ¼ 1 for infinite t is intuitive in that the initial state and the long-run state should be unrelated. Thus, an optimal clustering scheme is achievable through maximizing the group dependence measure Dt by varying group assignment s of all n points. In other words, we will focus on the problem of arg maxs Dt for a given t when the norm of s is fixed. To express the level of closeness to a group, group identity si is extended from discrete to continuous, which will be also explained in Section 2.3. To explain the simplified way to compute Dt , we need to describe the transition matrix of the data set. First, an n  n ‘‘neighborhood’’ or similarity matrix W is constructed from X. The similarity matrix W, symmetric by definition and also called the adjacency matrix, consists of nonnegative elements W i;j to represent the closeness between point i and point j, which is usually reciprocal to the distance between i and j. A big value of W i;j implies the two points are substantially close to each other. Several popular adjacency measures exist in the literature, and we emphasize that the choice of similarity or distance measure should be application-driven.2 A common choice for W i;j is Gaussian radial basis function

W i;j ¼ e

kxi xj k2



ð2:4Þ

r2

for a width parameter r > 0 that regulates the level of contribution of kxi  xj k to the closeness measure. The choice of r, usually comparable to distances values, depends on the belief of how fast closeness between two points decays as distances become larger. For example, the infinite value for r corresponds to the binary weight matrix of W for two connected points. Second, a ‘‘transition’’ matrix P from W is constructed. Following from similarity matrix W, we create the following transition matrix P ¼ D1 W, where diagonal matrix D is defined by P Di;i ¼ k W i;k , also called as the degree of point i:

Pi;j ¼

W i;j W i;j ¼P : Di;i k W i;k

ð2:5Þ

The element Pi;j P 0 is interpreted as one-step transition probability from point i and point j. Then Pi;j equals to PrðX 1 ¼ jjX 0 ¼ iÞ, P and j P i;j ¼ 1 can easily be verified. Unless information on self-transitions is given, non-informative self importance P W i;k k–i W i;i ¼ n1 is set so that the transition probability to the self equals to 1n:

W i;i W i;i Pi;i ¼ ¼P ¼P Di;i W i;k þ W i;i k–i

P

k–i

W i;k

n1 k–i W i;k

þ

P k–i

W i;k

¼

1 : n

ð2:6Þ

n1

The transition matrix P ðtÞ at the tth step becomes Pt , the tth power of ðtÞ P : Pi;j ¼ PrðX t ¼ jjX 0 ¼ iÞ ¼ P ti;j . The computational hurdle in computing Pt is overcome by one-time spectral decomposition of symmetric matrix D1=2 WD1=2 . After spectral decomposition, the matrix D1=2 WD1=2 is expressed as U W KW U TW : diagonal matrix KW 2 It is worth noting that self-closeness W i;i can be specified according to prior knowledge on self connectivity among those with its neighbors.

61

H. Park, K. Lee / Knowledge-Based Systems 60 (2014) 58–72

consists of the eigenvalues, and the columns of U W are the associated eigenvectors. Then transition matrix P becomes

P ¼ D1 W ¼ D1=2 ðD1=2 WD1=2 ÞD1=2 ¼ D1=2 U W KW U TW D1=2 : t

This allows P to be efficiently Pt ¼ D1=2 U W KtW U TW D1=2 since KW is diagonal. Lastly, we define a diagonal matrix BðtÞ by ðtÞ

Bj;j ¼ PrðX t ¼ jÞ ¼

ð2:7Þ

computed

via

n X PrðX t ¼ jjX 0 ¼ iÞPrðX 0 ¼ iÞ ¼ ½x>0 Pt j ;

ð2:8Þ

i¼1

where x0 is the initial probability vector reflecting prior information on initial states. Apparently adjustment of the prior probabilities for the initial states can bring forth Bayesian applications to the subseðtÞ

quent procedure. We observe that Bj;j is proportional to ½1> Pt j , the

Noticeably, DepðX 0 ¼ 2; X 2 ¼ 4Þ is nonzero 24 because state 4 is 36 reachable from state 2 by two steps, which means the neighborhood structure evolved through Markovian transitions. In this sense, t is called as a connectivity scale parameter, and obviously the division of a given network to maximize the group dependence measure depends on the choice of t. In this paper, we will empirically examine in Section 3 the performance of the proposed method varying t. Using the dependence measure DepðX 0 ¼ i; X t ¼ jÞ between two states, we express the maximization of the group dependence with respect to s subject to ksk ¼ 1 as follows:

1X ðDepðX 0 ¼ j; X t ¼ iÞ  1Þðsi sj þ 1Þ 2 i;j ksk¼1 X X si sj ¼ arg max DepðX 0 ¼ j; X t ¼ iÞsi sj 

arg maxDt ¼ arg max ksk¼1

jth column sum of P t , if all initial states have equal probability as non-informative prior, where 1 is the all-ones vector. Then depenPrðX t ¼j;X 0 ¼iÞ DepðX 0 ¼ i; X t ¼ jÞ ¼ PrðX t ¼jÞPrðX 0 ¼iÞ

dence

as

in

(2.1)

½P ðB Þ i;j . To understand dependence DepðX 0 ¼ i; X t ¼ jÞ better, we consider a simple graph structure in Fig. 2(a). The application of non-informative self-transition probabilities in (2.6) results in the 1

following transition matrix P; Bð1Þ , and DepðX 0 ; X 1 Þ ¼ P 1 ðBð1Þ Þ t ¼ 1: 22

3 8

3 8

2 8

3 8

2 8

2 8

0 0

6 8

8

63 6 68 P¼6 62 48

3 27 28 0 0 0 32 7 7 7 6 6 12 60 7 0 07 6 07 7 7 6 32 67 ð1Þ 7; B ¼ 6 7; DepðX 0 ; X 1 Þ ¼ 6 27 6 0 0 14 0 7 68 5 4 47 32 85 0

3

2 8

4 32

0 0 0

12 7

6 7

8 7

6 7

8 7

4 7 12 7

0 0

 1

For instance, P1;2 was computed by P1;2 ¼ 2 ðtÞ B1;1

0

for

3

7 07 7 7: 27 5 2

 1  28 , and

¼ 14 ðP 1;1 þ P 2;1 þ P3;1 þ P4;1 Þ assuming the initial probability of

each state is 14. We notice that DepðX 0 ¼ 2; X 1 ¼ 4Þ ¼ 0 because state 4 is unreachable from state 2 by one step. Similarly, for t ¼ 2,

we

have 2

two-step

transition

matrix

P2 ,

Bð1Þ ,

and

ð2Þ 1

DepðX 0 ; X 2 Þ ¼ P ðB Þ : 18 21 6 3 64 64 64 64

2 19 6 18 6 6 64 P2 ¼ 6 6 14 4 64

7

2 63

256

6 60 6 ð2Þ 7; B ¼ 6 8 7 60 4 64 5

19 21 6 7 64 64 64 7 14 28 64 64

12 12 24 16 64 64 64 64

0

0

0

63 256

0

0

94 256

0

0

0

3

72 63 63

84 94

24 3 36

76 63

84 94

24 7 36 7

2 76

7 6 72 6 07 7 6 63 7; DepðX 0 ; X 2 Þ ¼ 6 7 6 56 05 4 63 36 256

7 7:

56 112 32 7 63 94 36 5

48 48 63 63

96 94

64 36

i;j

i;j

  1 ¼ arg max s Pt ðBðtÞ Þ  11> s ¼ arg max s> Gs; ð2:9Þ >

becomes

ðtÞ 1

t

ksk¼1

ksk¼1 t

ðtÞ 1

ksk¼1

in which G ¼ P ðB Þ  11 . As a right stochastic matrix, P t has all real eigenvalues, and BðtÞ as a diagonal matrix also has real eigenvalues, which ultimately leads the eigenvalues of G to be all real. Thus, let us denote the eigenvalues of G by ki ; i ¼ 1; . . . ; n, in descending order and the associated eigenvectors si : k1 P    P kn . The obtained eigenvectors are assumed to be normalized and orthogonal: the application of the Gram-Schmidt process and si can yield the orthogonal and normalized vectors. We express jjsi jj group assignment as a linear combination of the eigenvectors P si : s ¼ ni¼1 ai si . Then we rewrite the group dependence >

n n n n n X X X X X ai s>i G aj sj ¼ ai s>i aj kj sj ¼ a2i ki i¼1

i¼j

i¼1

i¼j

ð2:10Þ

i¼1

because eigenvectors si are orthogonal. The goal here is to maximize Pn 2 i¼1 ai ki by finding appropriate group assignment s, or equivalently selecting the value of ai . This means selecting ai so as to put as much weight as possible to the term a1 in (2.10) involving the largest eigenvalue k1 . The existence of the largest and positive eigenvalue and its eigenvector s1 implies that the group dependence is maximally increased by adjusting a division of the network on the direction of the corresponding eigenvector s1 and that the division of the data points based on the signs of s1 . By matching a division of the network to the signs of elements, we gain positive contribution to the overall group dependence, the objective function. In fact, s1 is

Fig. 2. Graphical representations of the illustrating examples for Sections 2.2 and 2.5 are shown. This graphs contains (a) 4 nodes and 4 edges and (b) 10 nodes and 13 edges, respectively.

62

H. Park, K. Lee / Knowledge-Based Systems 60 (2014) 58–72

a one-dimensional representation of the data points, and the values of s1 could be used for a general purpose such as grouping other than bisection, visualization, and classification because the sign and magnitude of s1 relate to the degree of closeness to one group against the other. We note that the eigenvector associated with eigenvalue zero represents assigning all data points to just one group. On the contrary, the nonexistence of positive eigenvalue implies any further division of the data yields no gain. The presented approach has similarities to the graph Laplacian approach, which is one of the best-known methods of spectral clustering and basically minimizes a penalty proportional to similarity W i;j when data points i and j are assigned apart: P 2 i;j W i;j ðsi  sj Þ . Then we can show that spectral clustering minimizes s> ðD  WÞs while the proposed dependence clustering does 1

s> ð11>  Pt ðBðtÞ Þ Þs. One essential difference is that the graph Laplacian approach does not provide an inherent rule for the optimal number of clusters while the presented approach does. The presented approach is also similar to the modularity-based approach, which maximizes the strength of division of a network P P by i;j ðW i;j  Di;i Dj;j = k Dk;k Þsi sj . The presented approach, however, includes the connectivity of data points through Markovian transitions. Detailed comparisons with the approaches are provided in Section 3. 2.3. Optimality aspect of probabilistic connectedness We present a theoretical justification of dependence clustering from the aspect of probabilistic connectedness. It relates to a desirable attribute for a good group division to possess using prior and posterior probabilities of the states, which will be equivalent to the proposed method as in (2.9). First, let us assume the initial states are equally likely for simplicity. Then, let us consider PrðX 0 ¼ jjX t ¼ iÞ  PrðX 0 ¼ jÞ which is the difference between posterior probability of state j after reaching state i at the tth step and prior probability of state j. A large difference between PrðX 0 ¼ jjX t ¼ iÞ and PrðX 0 ¼ jÞ means a high likelihood of observing X 0 ¼ j given X t ¼ i in comparison with that of observing X 0 ¼ j at the beginning. We can interpret state j from the viewpoint of state i to be more likely than state j without any viewpoint. Thus, this quantity can serve as a measure for probabilistic connectedness between the two states: the higher the quantity, the more closely connected the two states. In light of the observation, a desirable clustering method will make best efforts to lead two states with a large positive value of the quantity to the same group identity:

X arg max ðPrðX 0 ¼ jjX t ¼ iÞ  PrðX 0 ¼ jÞÞsi sj ; ksk¼1

ð2:11Þ

i;j

2.4. Dividing more than two groups The procedure explained so far either divides a graph into two groups or decides not to divide further. It is natural to consider a network with more than two groups latent in its community structure. To obtain more than two clusters, we adopt a standard approach to subsequently divide the groups found [4]. Specifically, we look for a possible division for each group found in the previous step by constructing a new similarity matrix W ðgÞ for a group found, g  X. W ðgÞ , in size jgj  jgj, is defined by ðgÞ W i;j ¼ W i;j for all i; j 2 g. Following the same procedure of (2.5)– (2.8) based on W ðgÞ , the solution of arg maxksk¼1 DðgÞ in (2.9) ent ables us to decide whether a division is possible or not. It is important to note that the new group configuration with a new division found does not always result in the increase in the group dependence of the whole graph because the new similarity matrix W ðgÞ reflects only a part of the whole data set without considering connection to the nodes belonging to the other groups. Hence, among the possible divisions, we look for only the division which causes the group dependence in (2.3) for the whole data set to increase most when the new division is applied. This step of examining the overall group dependence can be carried out easily because the computation of group dependence in (2.3) incurs almost no additional computational burden for a given group assignment s. Given adjacency matrix W and connectivity scale t, we summarize the above steps with the following pseudo-code description: [1] [2] [3] [4] [5] [6] [7]

adjacency matrix W ðgk Þ ; [8]

[9] [10] [11] [12] [13] [14] [15] [16]

where si represents the level of closeness or contribution of state i to a group with the positive sign for a group and the negative one for the other. The term PrðX 0 ¼ jÞ in (2.11) can account for possible unequal importance of the initial states. Straightforwardly, the equation in (2.11) is equivalent to that in (2.9) when all initial states have equal importance:  XPrðX 0 ¼ j; X t ¼ iÞ  PrðX 0 ¼ jÞ si sj the equation in ð2:11Þ ¼ arg max PrðX t ¼ iÞ ksk¼1 i;j X ¼ arg max PrðX 0 ¼ jÞðDepðX 0 ¼ j; X t ksk¼1

i;j

¼ iÞ  1Þsi sj ¼ arg max Dt by ð2:9Þ: ksk¼1

Therefore, we assert that the proposed dependence clustering is justifiable in the sense of connectedness through prior and posterior probabilities.

Initialize the number of groups, ‘, with 1; Initialize the to-be-split group, g 1 , with all data points; Initialize the group dependence, Dt as in (2.3), for g 1 ; While TRUE Set Dnext and K to zero; t For each g k in ‘ groups Find a sub-division by solving (2.9) for

[17] [18] [19]

Compute new group dependence Dkt with the sub-division applied; If Dkt > Dnext t Update Dnext with Dkt and K with k; t End If End For If Dnext > Dt t Apply the sub-division of g K ; Update ‘ with ‘ þ 1 and Dt with Dnext ; t Else Finish the loop; End If End While

Overall, the major computational bottleneck of the proposed method is computing eigenvectors of n  n matrix G. The eigenvalue computation is universal throughout most spectral clustering methods, and the time complexity is Oðn3 Þ without special optimizations. However, finding the first eigenvector s1 for a sparse graph can be solved in Oðn2 Þ by an adapted power method such as the Lanczos algorithm [11]. Then the repetitive sub-division process occurs until the network is reduced to its indivisible subgraphs. The average depth of the division is given by Oðlog nÞ, which results in the complexity of Oðn2 log nÞ for the whole algorithm.

63

H. Park, K. Lee / Knowledge-Based Systems 60 (2014) 58–72

2.5. Illustrating example

3. Experimental results

To illustrate the whole procedure of the dependence clustering, we demonstrate the method on the network shown in Fig. 3. This undirected graph contains 10 nodes and 13 binary-weight edges. Fig. 3(b) and (c) shows the conversion process from a graph to the corresponding similarity matrix W and transition matrix P. The probability of transition to the self is set to be non-informative, i.e., diagonals of the transition matrix are set to be 0.1 according to (2.6). The construction of G by (2.8) for t ¼ 1 and its eigenvalues are shown in Fig. 3(d) and (e). Obviously, the first two eigenvalues are positive as well as preeminent, which justifies the division procedure by the maximization of the group dependence as in (2.9). Next, the scores of the 10 node for the first two eigenvectors are shown in Fig. 3(f), and the first division by the signs for s1 separates nodes 1, 2, and 3 from the other seven nodes. Similarly, the subsequent procedure separates nodes 8, 9, and 10 from the remaining nodes as in Fig. 3(g). It is interesting to observe that the 2-dimensional representation in Fig. 3(f) closely reveals the network structure in Fig. 3(a): a cluster of nodes 8, 9, and 10 is already found in Fig. 3(f). The dependence clustering finds out three clusters correctly: f1; 2; 3g; f4; 5; 6; 7g; f8; 9; 10g. To see the effect of t in this example, we also show how the dependence clustering works differently as we change t. For t  6, three clusters f1; 2; 3g; f4; 5; 6; 7g, and f8; 9; 10g, were obtained except for t ¼ 2 with four clusters f1; 2; 3g; f4; 5; 6; 7g; f8; 10g, and f9g. However, for t > 6 two clusters f1; 2; 3g; f4; 5; 6; 7; 8; 9; 10g were obtained. This result is consistent with the observation that large t relates to a wide level of connectedness of the graph. We will see detailed performance of the proposed method in the following Section.

3.1. Simulation tests

1

2

3

4

5

6

7

8

9

10

1

1

1

1

0

0

0

0

0

0

0

1

2

1

1

1

0

0

0

0

0

0

0

3

1

1

1

1

0

0

0

1

0

0

4

0

0

1

1

1

0

1

1

0

0

4

0

0

5

0

0

0

1

1

1

0

0

0

0

5

0

0

6

0

0

0

0

1

1

1

0

0

0

6

0

0

7

0

0

0

1

0

1

1

0

0

0

7

0

0

8

0

0

1

1

0

0

0

1

1

1

8

0

0

9

0

0

0

0

0

0

0

1

1

1

9

0

0

0

10

0

0

0

0

0

0

0

1

1

1

10

0

0

0

A natural question arises at this point: how well does the dependence clustering perform compared to other popular community structure detection algorithms? We set up an experiment to compare the performance of the dependence clustering to that of popular clustering methods: agglomerate hierarchical clustering, spectral clustering, and the modularity clustering proposed by Newman [4]. We use randomly generated correlation matrices in which the underlying community structure is obvious for this comparison. First, we generate a fixed number of multivariate normal random variables with zero means from the true correlation matrix specified in Fig. 4 in which four groups exist. The sample correlation matrix, computed from the generated random vectors, converges to the true correlation matrix as the number of observations increases. Conversely, the matrix becomes noisy when only a few observations are made. The right choice in the number of observations per replication is important because all methods under comparison will cluster correctly if the sample correlation matrix is too close to the true correlation matrix; all methods will fail if the sample contains too much noise. We choose nine observations for each replicated sample correlation matrix following Stone and Ayroles [12]. A simulated correlation matrix represents a graph of 12 nodes with edges whose weights are Pearson sample correlation coefficients ri;j ranging from 1 to 1. Since all closeness measures between nodes are nonnegative, we obtain the distance between two points i and j by the reciprocal of r i;j shifted by 1:

1

2

3

4

5

6

7

8

9

10

4

5

6

7

8

9

10

0.1

0.45

0.45

0

0

0

0

0

0

0

1

0.19

4.707 2.003

-1.1

-1.1

-1.1

-1.1

-1.1

-1.1

-1.1

2

0.45

0.1

0.45

0

0

0

0

0

0

0

2

4.707

0.19

-1.1

-1.1

-1.1

-1.1

-1.1

-1.1

-1.1

3

0.225 0.225

0.1

0.225

0

0

0

0.225

0

0

3

1.803 1.803 -0.41 0.452

-1.1

-1.1

-1.1

0.452

-1.1

-1.1

0.225

0.1

0.225

0

0

0

4

-1.1

-1.1

0

0.45

0.1

0.45

0

0

0

5

-1.1

-1.1

-1.1

0

0

0.45

0.1

0.45

0

0

0

6

-1.1

-1.1

0

0.45

0

0.45

0.1

0

0

0

7

-1.1

-1.1

0.225 0.225

0

0

0

0.1

8

-1.1

-1.1

0

0

0

0

0.45

0.1

0.45

9

-1.1

-1.1

-1.1

0

0

0

0

0.45

0.45

0.1

10

-1.1

-1.1

-1.1

0.225 0.225 0

0.225 0.225

1

2

3

2.003

0.452 -0.41 1.803

-1.1

-1.1

2.003

0.19

-1.1 3.4

1.803 0.452 -1.1

-1.1

-1.1

-1.1

-1.1

-1.1

4.707

-0.1

4.707

-1.1

-1.1

-1.1

-1.1

2.003

-1.1

3.4

0.19

-1.1

-1.1

-1.1

0.452 0.452

-1.1

-1.1

-1.1

-0.41 1.803 1.803

-1.1

-1.1

-1.1

-1.1

2.003

-1.1

-1.1

-1.1

-1.1

2.003 4.707 0.19

0.19

4.707

10

0.6

Eigenvalue

5

0

0.4 0.2

2

3

4

5

6

Eigenvalue Order

7

8

9

10

3

4

0

1 2

−0.2

8

−0.4 −5 1

0.4

65 7

2nd eigenvector s2

2nd eigenvector s2

0.6

−0.4

−0.2

9 10 0

1st eigenvector s1

0.4

0.6

6

0.2 0

8

−0.2 −0.4

0.2

4

10

−0.5

7

9

5 0

0.5

1st eigenvector s1

Fig. 3. Illustration of steps in dependence clustering is shown. The first division separates nodes 1, 2, and 3 from the others as in (e), and the second division does nodes 8, 9, and 10 from the remaining nodes as in (f).

64

H. Park, K. Lee / Knowledge-Based Systems 60 (2014) 58–72

1

2

3

4

5

6

7

8

9

10

11

12

1

1

0.9

0.2

0.2

0.2

0.2

0

0

0

0

0

0

2

0.9

1

0.2

0.2

0.2

0.2

0

0

0

0

0

0

3

0.2

0.2

1

0.7

0.7

0.7

0

0

0

0

0

0

4

0.2

0.2

0.7

1

0.7

0.7

0

0

0

0

0

0

5

0.2

0.2

0.7

0.7

1

0.7

0

0

0

0

0

0

6

0.2

0.2

0.7

0.7

0.7

1

0

0

0

0

0

0

7

0

0

0

0

0

0

1

0.6

0.6

-0.4

-0.4

-0.4

8

0

0

0

0

0

0

0.6

1

0.6

-0.4

-0.4

-0.4

9

0

0

0

0

0

0

0.6

0.6

1

-0.4

-0.4

-0.4

10

0

0

0

0

0

0

-0.4

-0.4

-0.4

1

0.8

0.8

11

0

0

0

0

0

0

-0.4

-0.4

-0.4

0.8

1

0.8

12

0

0

0

0

0

0

-0.4

-0.4

-0.4

0.8

0.8

1

Fig. 4. The true correlation matrix used to generate a fixed number of multivariate random variables aggregated to obtain a simulated correlation matrix. {1, 2}, {3, 4, 5, 6}, {7, 8, 9}, {10, 11, 12} clusters are eminent. We adopt this matrix from Stone and Ayroles [12].

kxi  xj k ¼ 1=ðr i;j þ 1Þ. The step is reasonable in the sense that the higher ri;j is—the more positively the two nodes are correlated, the closer the two nodes are—the smaller the distance becomes. We note that the shift amount, being arbitrary, was set to the minimum to make the distance positive. The general form of Gaussian radial function in (2.4) was used to obtain similarity matrix W. To see the effect of the width parameter r, we varied it from 0:15 to 0:3 and 0:5 considering the variance of the distances is 0:11. While the modularity clustering and the dependence clustering do not need the number of clusters ðkÞ a priori, spectral and agglomerate hierarchical clustering algorithms do. In order to make a fair comparison, we misinform the latter algorithms of k ¼ 3; 5 as well as provide the true number of clusters where k ¼ 4. For spectral clustering, we used symmetric normalized graph Laplacian D1=2 LD1=2 where D is the diagonal matrix of row sums of the input distance matrix and L ¼ D  W. We extracted k smallest eigenvectors from the Laplacian and clustered using the k-means algorithm. To measure the quality of a clustering result, we used the Rand measure [13], as well as the ratio of correctly clustered pairs. To complement the Rand measure, we also used normalized mutual information as well. Both scoring methods assume that the true clustering result is known. To compute the Rand measure, we iterate over all pairs of nodes and count how many pairs are correctly clustered or separated. Suppose we have n nodes in the graph. Let X ¼ fxi g, i ¼ 1; . . . ; n be the clustering result under evaluation and Y ¼ fyi g, i ¼ 1; . . . ; n be the true community structure where xi s; yi 2 Z denote the integer identifier of the cluster to which point i belongs. The Rand measure as well as correctly clustered (CC) and correctly separated (CS) measures are defined as follows:

P

i;j 1fxi ¼xj g 1fyi ¼yj g

Rand ¼

þ P

P CC ¼ P

i;j 1fxi –xj g 1fyi –yj g

i–j 1

i;j 1fxi ¼xj g 1fyi ¼yj g

P

;

ð3:1Þ

;

ð3:2Þ

i;j 1fxi –xj g 1fyi –yj g P : 1 1 i;j fxi –xj g fyi –yj g þ i;j 1fxi ¼xj g 1fyi –yj g

ð3:3Þ

i;j 1fxi ¼xj g 1fyi ¼yj g

P

CS ¼ P

P

þ

i;j 1fxi –xj g 1fyi ¼yj g

CC and CS measures provide additional insight into whether a clustering result gathers nodes in the same group too much or too less, which the aggregate Rand measure does not capture explicitly. Among many variants of normalization, we follow Strehl and Ghosh [14] to compute normalized mutual information defined as the mutual information between a clustering result and the true clustering normalized by square root of the multiplication of the entropies.

IðX; YÞ NMIðX; YÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; HðXÞHðYÞ

ð3:4Þ

where IðX; YÞ is the mutual information defined in (2.2) and HðXÞ; HðYÞ are the standard entropy formula for a finite sample: P HðXÞ ¼  i PrðX i ¼ xi Þlog2 PrðX i ¼ xi Þ. Each clustering method is evaluated over 10,000 replications. Nine observations drawn from the true correlation matrix using multivariate normal function construct the sample correlation matrix for each replication. As a sufficient number of observation leads to good accuracy of clustering results, for each clustering method we chose nine observations following Stone and Ayroles [12]. Table 1 shows the result of simulation experiments. All clustering methods perform well in that the Rand measure is above 70% in

65

H. Park, K. Lee / Knowledge-Based Systems 60 (2014) 58–72

Table 1 Simulation experiments result. The proposed method outperforms other clustering methods in both the Rand measure and normalized mutual information when r < 0:3. CC stands for the ratio of correctly clustered pairs, CS means the ratio of correctly separated pairs, and NMI denotes normalized mutual information given by (3.4). The boldfaced numbers represent the best performance for the criteria.

r ¼ 0:15 Dependence t¼1 t¼2 t¼3 t¼4 t¼5 t¼6 t¼7 t¼8 t¼9 t ¼ 10

r ¼ 0:3

r ¼ 0:5

CC

CS

Rand

NMI

CC

CS

Rand

NMI

CC

CS

Rand

NMI

0.8086 0.8061 0.8795 0.8836 0.8980 0.9005 0.9062 0.9081 0.9108 0.9129

0.9763 0.9644 0.9503 0.9435 0.9352 0.9300 0.9252 0.9207 0.9175 0.9134

0.9433 0.9332 0.9364 0.9317 0.9279 0.9242 0.9215 0.9183 0.9162 0.9133

0.9112 0.8959 0.9073 0.9019 0.9007 0.8965 0.8953 0.8918 0.8908 0.8878

0.8710 0.8853 0.9063 0.9061 0.9059 0.8981 0.8931 0.8832 0.8741 0.8625

0.9466 0.9113 0.8864 0.8664 0.8529 0.8387 0.8317 0.8227 0.8193 0.8126

0.9318 0.9062 0.8903 0.8742 0.8633 0.8504 0.8438 0.8346 0.8301 0.8224

0.8988 0.8712 0.8597 0.8428 0.8344 0.8207 0.8166 0.8064 0.8037 0.7949

0.8632 0.8313 0.8017 0.7025 0.6388 0.5518 0.4981 0.4351 0.3953 0.3477

0.8902 0.8538 0.8495 0.8492 0.8677 0.8759 0.8932 0.9007 0.9145 0.9200

0.8849 0.8493 0.8401 0.8203 0.8226 0.8120 0.8154 0.8090 0.8122 0.8073

0.8401 0.7990 0.7972 0.7687 0.7750 0.7576 0.7634 0.7527 0.7578 0.7500

Modularity 0.1910

0.9992

0.8400

0.7830

0.6878

0.9887

0.9295

0.8928

0.8550

0.9554

0.9356

0.9025

Hierarchical 0.9225 k¼3 0.8477 k¼4 0.7343 k¼5

0.7603 0.9133 0.9558

0.7922 0.9004 0.9122

0.7793 0.8664 0.8699

0.9514 0.8716 0.7429

0.7900 0.9172 0.9588

0.8217 0.9082 0.9163

0.8105 0.8783 0.8744

0.9407 0.8616 0.7393

0.7916 0.9129 0.9565

0.8210 0.9028 0.9137

0.8031 0.8697 0.8705

Spectral k¼3 k¼4 k¼5

0.7195 0.8478 0.9216

0.7543 0.8413 0.8808

0.7291 0.7987 0.8293

0.9137 0.8373 0.7211

0.7621 0.8894 0.9456

0.7919 0.8792 0.9014

0.7666 0.8393 0.8519

0.9122 0.8372 0.7013

0.7963 0.9142 0.9551

0.8192 0.8991 0.9051

0.7866 0.8567 0.8512

0.8962 0.8149 0.7145

all cases. Both hierarchical and spectral clusterings performed best when they are correctly informed by k ¼ 4 in terms of normalized mutual information. Rand measure for the hierarchical clustering shows slightly puzzling results. According to Rand measure, hierarchical clustering performs best when it is informed of k ¼ 5 not k ¼ 4. Knowing that the number of clusters in the true correlation matrix is definitely four, it suggests that normalized mutual information is more accurate in scoring and comparing the quality of clustering results. Among the misinformed cases where k ¼ 3; 5, we observe the hierarchical and spectral clusterings suffer more when they err on the lower side of k where they are informed of k ¼ 3. Regardless, misinformed or not, the hierarchical and spectral clusterings show stable performance over the entire range of r 2 ½0:15; 0:5. On the other hand, the performance of both dependence and modularity clusterings has a strong trend over r in the opposite direction to each other. As r decreases, the performance of dependence clustering rises and that of modularity clustering falls. The performance of our method, the dependence clustering, is comparable to the modularity clustering overall and outperforms it in lower range of r 2 ½0:15; 0:3. Larger r reduces the contrast in edge weights, thus provides a disadvantage to our method compared to the modularity clustering. In the lower range of r, the performance of dependence clustering dominates that of the other methods.

3.2. Karate club data To validate the dependence clustering and to demonstrate the effect of varying connectivity scale parameter t, we apply the method to the classic social network data from the karate club faction analysis by Zachary [6]. Each person in this karate club is associated with one of two factions. To compare the clustering result with the factual division of the network, we vary connectivity scale t from 1 to 2, 4, and 10. Simply speaking, a high value of connectivity scale t causes the division of the group to stop quickly because the algorithm considers group dependence in more global context like the modularity clustering, while small t makes the division result more sensitive to local geometry of networks.

Fig. 5(a)–(d) shows the progressive change of division results as t increases. The shape of nodes—square or circle—denote the factual division of the network found by Zachary [6], while the color of nodes represents the clustering results produced by our method. Although we did not supply the preset number of clusters in the output, our algorithm suggests different numbers of communities latent in the graph depending on the connectivity scale parameter t. Seven communities are identified when t ¼ 1 as Fig. 5(a) shows, while the dependence clustering performs only one division with t ¼ 10. The cut found by the algorithm where t ¼ 10 exactly corresponds to the fission of the club observed by Zachary [6] as Fig. 5(d) shows.

3.3. Analysis on Groupon tag co-occurrence network Groupon is an online daily deals promotion site that connects local merchants with local customers. Groupon staff handpicks one up to four tags for each daily deal they promote. Since each deal is unique in its offering and deal terms, tags serve to describe the general characteristic of each deal. Examples of tags include pizza, yoga, museum, etc. As the same tag may appear in different deals multiple times, we can construct a tag co-occurrence matrix which becomes our graph data to be analyzed using different clustering methods including the dependence clustering. To construct the data set, we first collected the daily deals data from Groupon’s Application Programming Interface from October 2008 to November 2011. Our daily deals data set contains 127,794 deals and 574 unique tags from which we computed the tag co-occurrence matrix. We further chose only tags that appeared in more than 100 deals to ensure matrix to be invertible. As a result, our final data set includes 201 unique tags. The tag co-occurrence matrix directly translates into the similarity matrix W for the dependence clustering because higher number of cooccurrence suggests stronger relationship between two tags. We first visualized tag clusters identified by the dependence clustering. Figs. 6–8 show the clustering results for t 2 f1; 4; 10g.3 3

We used Gephi [15] to produce the graphics.

66

H. Park, K. Lee / Knowledge-Based Systems 60 (2014) 58–72

Fig. 5. (a)–(d) Are dependence clustering results with t ¼ 1; 2; 4, and 10, respectively. When t ¼ 10, Dependence clustering successfully identifies two factions in accordance with the observation of Zachary [6]. We use two shapes (square and circle) to denote the two factual factions. Colors represent communities identified by dependence clustering. The resolution of division becomes progressively coarse as t increases. Edge width reflects the relative strength between nodes ranging from 1 to 7 as recorded by Zachary [6]. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

The color of nodes denotes which cluster the node belongs to; the nodes in the same color are clustered into the same group by dependence clustering. The size of nodes reflects how many daily deals the tag appeared in. A few tags—Arts and Entertainment, Beauty and Spas, Food and Drink, Health and Fitness, Restaurants, and Shopping—that appeared in many deals are eminent, while a variety of exotic tags also exist in the data set. We used the OpenORD layout [16] with the default edge cutting parameter of 0.8. As OpenORD expects undirected weighted graph, it fits our purpose to visualize the tag co-occurrence matrix having integer weights. This layout aims to cluster tags that appeared together many times close to each other. Thus, we expect a good clustering algorithm will produce a graphic in which colors change gradually over regions. From this qualitative evaluation perspective, Fig. 8 showing smooth color change pattern over regions exhibits clear contrast to Fig. 6 where noisy group assignments are noted. This contrast concurs with the prediction from our theory. As our theory suggests, large t value produces more coarse structure while small t identifies many local clusters. Note that connectivity scale parameter t differs from the usual k parameter specifying the number of clusters that is required for some methods such as hierarchical and spectral clustering using k-means algorithm. A connectivity scale parameter t does not specify the output

number of clusters. Rather, it sets the scope of clustering analysis. Ability to vary the scope can be useful as the true number of underlying communities is unknown a priori in most community detection problems in real world. This feature also distinguishes our method from the modularity clustering because the latter provides only a single fixed clustering answer for a given network structure. Thus, dependence clustering can potentially provide flexibility to researchers especially for exploratory network analysis. We next turn to quantitative performance comparison among different clustering methods. Fortunately, Groupon provides a taxonomy that assigns each tag to one of 18 broader categories. This taxonomy serves as a true community structure that allows us to compare different methods. As we reduced the number of tags to 201, the number of broader categories that have more than one tag was also reduced to 13. Similarly, in the previous simulation tests, we not only give the correct number of clusters, k ¼ 13, but also intentionally misinform with k ¼ 12; 14 the spectral and hierarchical clustering methods. All clustering methods, except for spectral clustering, involve no randomness in the process of clustering. k-means clustering module within the spectral algorithm needs starting centroid which is randomly determined. Thus, we apply spectral clustering to the same tag co-occurrence matrix 100 times and report the mean for every performance measure.

H. Park, K. Lee / Knowledge-Based Systems 60 (2014) 58–72

67

Fig. 6. Dependence clustering result on Groupon tag co-occurrence network for t ¼ 1.

Fig. 7. Dependence clustering result on Groupon tag co-occurrence network for t ¼ 4.

Table 2 shows the comparison result. Admittedly, modularity clustering gives highly accurate clustering results with normalized mutual information over 0.95. Striking is the critical failure of hierarchical clustering that scores 0.15 in normalized mutual

information. Considering the hierarchical clustering performs well on the simulated set of correlation matrices, this evidence suggests that blind use of hierarchical clustering for the task of community structure detection in real-world applications may produce

68

H. Park, K. Lee / Knowledge-Based Systems 60 (2014) 58–72

Fig. 8. Dependence clustering result on Groupon tag co-occurrence network for t ¼ 10.

Table 2 Groupon tag co-occurrence network clustering results. Although modularity clustering works extremely accurately on this data set, dependence clustering shows comparable level of performance. Hierarchical clustering severely fails in revealing the correct underlying community structure. Spectral clustering shows reasonable performance. Performance of spectral clustering is averaged over 100 computations as k-means module in the spectral algorithm requires a random seed. CC and CS stands for the ratio of correctly clustered pairs and the ratio of correctly separated pairs, respectively. The last column shows the number of inferred clusters for each algorithm and setting. CC Dependence t¼1 0.5612 t¼2 0.9237 t¼3 0.8403 t¼4 0.9241 t¼5 0.9535 t¼6 0.9315 t¼7 0.9552 t¼8 0.9469 t¼9 0.9732 t ¼ 10 0.9623

CS

Rand

NMI

Number of clusters

0.9555 0.9733 0.8878 0.9675 0.8983 0.9677 0.9034 0.9441 0.8547 0.9053

0.9108 0.9677 0.8824 0.9625 0.9046 0.9636 0.9093 0.9444 0.8682 0.9118

0.7251 0.8909 0.7547 0.8821 0.8133 0.8868 0.8214 0.8535 0.7747 0.8234

16 11 9 11 8 11 8 9 6 8

Modularity 0.9588

0.9928

0.9889

0.9575

13

Hierarchical 0.8749 k ¼ 12 0.8701 k ¼ 13 0.8657 k ¼ 14

0.1149 0.1249 0.1348

0.2011 0.2094 0.2177

0.1340 0.1432 0.1524

12 13 14

Spectral k ¼ 12 k ¼ 13 k ¼ 14

0.8265 0.8696 0.8796

0.8109 0.8514 0.8578

0.6674 0.7213 0.7256

12 13 14

The accuracy of dependence clustering peaks at t ¼ 6 and declines towards t ¼ 1 or 10. This accuracy is with respect to the taxonomy defined by Groupon. We look into the clustering results to see if the results of t ¼ 1 and t ¼ 10 provide some additional qualitative insights that the aggregate performance measure does not tell. Fig. 9 shows such in-depth look for the two most frequently occurring tag categories: ‘‘Arts and Entertainment’’ and ‘‘Restaurants’’. Indeed, t ¼ 1 generates more granular clusters which is ‘‘incorrect’’ according to the Groupon’s taxonomy. However, looking at the individual granular clusters identified at t ¼ 1 reveals additional insight on the underlying community structure. For example, we know from t ¼ 1 results that the pair of ‘‘Bike Tours’’ and ‘‘Segway Tours’’ appear together frequently to describe a deal. So do the pair of ‘‘Boating’’ and ‘‘Fishing’’ and the pair of ‘‘Dinner Theater’’ and ‘‘Theater & Plays’’. Another interesting insight on the ‘‘Restaurants’’ side is that ‘‘Fine Dining’’ is more associated with ‘‘French’’, ‘‘Greek’’, ‘‘Mediterranean’’ cuisines serving ‘‘Seafood’’ and ‘‘Steakhouses’’ foods rather than with ‘‘Chinese’’, ‘‘Indian’’, ‘‘Japanese’’, ‘‘Thai’’ cuisines serving ‘‘Barbeque’’, ‘‘Hot Dogs’’, ‘‘Pizza’’, ‘‘Sandwich/Deli’’ foods. This in-depth look and the interpretation support that t is not mere a tuning parameter, but rather different t values—whether it be lower or higher— provide useful clustering results on its own.

3.4. Analysis on ground-truth networks 0.6890 0.7084 0.6870

misleading results. Dependence clustering with t ¼ 4 scores 0.88 in normalized mutual information, outperforming spectral clustering with all k ¼ 12; 13; 14. Though worse than modularity clustering, dependence clustering gives reasonably stable and accurate identification of underlying clusters in all t values.

We further extend our experiments into new data sets. An ideal data set would be an undirected graph with a true community structure so that we can quantitatively evaluate the performance of clustering methods. Fortunately, Yang and Leskovec [17] constructed large social and information network data sets along with community structures of nodes. We applied dependence clustering along with the other three methods to three undirected network data, each of which provides a subtly different context. The first

69

H. Park, K. Lee / Knowledge-Based Systems 60 (2014) 58–72

Arts and Entertainment

Restaurants

Arts and Entertainment

Entertainment

Arts/Crafts/Hobbies Bus Tours Canoe & Kayak Rentals Dance Fun Center Go Karts Hot Air Balloon Kid's Activities Laser Tag Miniature Golf Movie Theaters Museum Music Concert Outdoor Pursuits Paintball Plane & Helicopter Tours Skating Skydiving Speedway Sporting Events Symphony & Orchestra Walking Tours Zoos

Art Galleries

Dessert/Bakery

Irish Tapas

Bike Tours Segway Tours Boating Fishing

Family

Bowling Dinner Theater Theater & Plays

Fine Dining French Greek Mediterranean Seafood Steakhouses

Amusement Parks

American/Traditional Asian/Sushi Barbeque Breakfast & Brunch Buffets Burgers Cafe Chinese Coffee & Tea Diners Fast Food Hot Dogs Indian Italian Restaurant Japanese Mexican/Latin Restaurant Organic/Vegetarian Pizza Pre-Prepared Meals Restaurants Sandwich/Deli Thai

Festivals Gambling & Gaming

t=1

t=4

t = 10

Fig. 9. Dependence clustering results for the two most frequently occurring tags (‘‘Arts and Entertainment’’ and ‘‘Restaurants’’) are shown. Although the result of t ¼ 1 separates tags incorrectly according to Groupon-defined taxonomy, the identified clusters reveal additional insight on the co-occurrence structure of the tags.

one is Amazon’s product network based on users’ co-purchasing behavior. Amazon has a feature that lists related merchandise under the tab named ‘‘Customers Who Bought This Item Also Bought.’’ This feature allows Yang and Leskovec [17] to construct product co-purchasing network data set. Amazon-defined product categories serve as the ground-truth communities. The second data set is from DBLP, a computer bibliography website, containing author’s collaboration network in computer science. Authors are connected if they collaborate on a publication and the outlet venue defines ground-truth communities of the authors. The last one comes from YouTube community network data. YouTube users form friendships as well as voluntarily join groups. Friendship constructs the undirected social network graph and group membership provides ground-truth communities of the users. As described above, each of three data sets contains an undirected graph with binary edge weight and node-community membership pairs as desired. The only practical obstacle of applying clustering algorithms to these data as an entirety is that the size of these networks is too large. Thus, we devised a work-around to overcome such difficulties. We randomly pick 10 mutually

exclusive communities from the top 5000 high-quality communities list that Yang and Leskovec [17] provide. To alleviate computational burden, we reject a sample containing more than 300 nodes. As these communities are mutually exclusive by our construction, the true community membership of each node allows us to quantitatively compare performance among different clustering methods. We extract 100 such random samples for experiment. Table 3 displays the summary network and community statistics of the samples and compares it to that of population. Note that the sample statistics are averaged over 100 samples. The three networks show differences in multiple aspects. First, the average clustering coefficient of the population is highest in DBLP (0.7321) and lowest in YouTube (0.1723), and the sample shows the same trend. Considering the edge-to-node ratio is relatively the same across all three data sets (2.76 for Amazon, 3.31 for DBLP, 2.63 for YouTube), highly different clustering coefficient suggests different edge structure. The average clustering coefficient of the sample is higher than that of the population in all three cases. This is seemingly related to our sampling decision to select only the nodes which are members of a group. Assuming nodes within a group have a denser edge

Table 3 Descriptive statistics on the random samples from the ground-truth networks data. Amazon

Number of nodes Number of edges Average clustering coefficient Number of triangles Number of communities Average community size Average membership size

DBLP

YouTube

Sample

Population

Sample

Population

Sample

Population

131.55 387.45 0.6886 1297.23 10 13.16 1.00

334,863 925,872 0.4297 667,129 151,037 19.38 8.74

106.66 313.67 0.8768 2138.16 10 10.67 1.00

317,080 1,049,866 0.7321 2,224,385 13,477 53.41 1.69

103.44 170.71 0.3003 296.40 10 10.34 1.00

1,134,890 2,987,624 0.1723 3,056,386 8385 13.50 0.10

70

H. Park, K. Lee / Knowledge-Based Systems 60 (2014) 58–72

Table 4 Clustering results on three ground-truth networks: Amazon, DBLP, and YouTube. For each data set, dependence clustering outperforms modularity clustering for t > 2. Hierarchical clustering shows different performance across the data sets, while spectral clustering shows stable and reliable performance. Amazon

Dependence t¼1 t¼2 t¼3 t¼4 t¼5 t¼6 t¼7 t¼8 t¼9 t ¼ 10

DBLP

YouTube

CC

CS

Rand

NMI

CC

CS

Rand

NMI

CC

CS

Rand

NMI

0.6648 0.7444 0.7935 0.8318 0.8607 0.8660 0.8682 0.8782 0.8812 0.8793

1.0000 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

0.9262 0.9403 0.9497 0.9566 0.9633 0.9642 0.9648 0.9672 0.9677 0.9673

0.9185 0.9379 0.9507 0.9598 0.9658 0.9669 0.9676 0.9701 0.9707 0.9699

0.7052 0.8040 0.8566 0.8755 0.8909 0.8961 0.9012 0.9105 0.9146 0.9129

0.9999 0.9999 0.9999 0.9999 1.0000 1.0000 1.0000 1.0000 0.9998 0.9997

0.9428 0.9550 0.9619 0.9647 0.9671 0.9680 0.9685 0.9702 0.9712 0.9703

0.9300 0.9534 0.9641 0.9692 0.9728 0.9742 0.9747 0.9766 0.9773 0.9768

0.4539 0.7351 0.8502 0.8368 0.9009 0.8757 0.9184 0.9244 0.9478 0.9423

0.9886 0.9875 0.9920 0.9885 0.9901 0.9823 0.9840 0.9856 0.9844 0.9820

0.8408 0.9099 0.9409 0.9353 0.9564 0.9419 0.9585 0.9608 0.9708 0.9654

0.8192 0.8926 0.9347 0.9242 0.9470 0.9304 0.9473 0.9510 0.9583 0.9548

Modularity 0.7916

1.0000

0.9504

0.9501

0.8455

0.9999

0.9610

0.9624

0.7238

0.9937

0.9017

0.9019

Hierarchical 0.7843 k¼9 0.7600 k ¼ 10 0.7306 k ¼ 11

0.3300 0.3553 0.3881

0.4209 0.4376 0.4598

0.3085 0.3239 0.3444

0.7966 0.7728 0.7457

0.6917 0.7227 0.7537

0.7157 0.7396 0.7627

0.6687 0.6847 0.6996

0.7609 0.7440 0.7269

0.2146 0.2396 0.2666

0.3662 0.3822 0.4001

0.1839 0.2006 0.2182

Spectral k¼9 k ¼ 10 k ¼ 11

0.9286 0.9491 0.9483

0.8889 0.8964 0.8842

0.8335 0.8485 0.8251

0.8311 0.8154 0.7171

0.9065 0.9232 0.9388

0.8834 0.8938 0.8957

0.8449 0.8598 0.8496

0.7043 0.6827 0.6000

0.9345 0.9722 0.9749

0.8632 0.8861 0.8731

0.8077 0.8468 0.8280

0.7521 0.7105 0.6280

0.06

0.04

0.02

0

0.02

0.04

0.06

0.08

0.1

Amazon DBLP YouTube 1

2

3

4

5

6

7

8

9

10

Fig. 10. Difference of normalized mutual information between dependence clustering and modularity clustering in connectivity scale parameter t for Amazon, DBLP, and YouTube is plotted. Dependence clustering outperforms modularity clustering for t > 2. The performance of dependence clustering on the YouTube data set shows fluctuation in the range of t 2 f3; 4; 5; 6g.

structure, the higher average clustering coefficients of the sample than the population is reasonable. Second, the average community size is three or four times larger in DBLP (53.41) than in Amazon (19.38) and YouTube (13.50). Note that the notion of community in each data set differs. A community in Amazon is a product category; a community in DBLP is a publication venue; a community in YouTube is a user-defined group. Within this context, each number for the average community size makes sense. For example, it is

reasonable that a publication venue has 53 distinct authors on average. This difference in average community size, however, disappears in the samples. We attribute it to our sampling decision to reject a sample containing more than 300 nodes in order to alleviate computational burden. Lastly, average membership size differs in order of ten across the three data sets. On average, an Amazon product is assigned to 8.74 different product categories; an author in DBLP publishes to 1.69 venues; only one out of ten users of

H. Park, K. Lee / Knowledge-Based Systems 60 (2014) 58–72

YouTube joins an interest group. In the samples, of course, the average membership size is unanimously 1 as we collect only mutually exclusive communities on purpose. Table 4 shows the performance comparison among four clustering methods. The performance of dependence clustering increases in t monotonically until t ¼ 9 and then slightly decline at t ¼ 10 in all three cases. It implies that higher t not necessarily leads to more accurate clustering results. Decision on t value ought to be contextand application-driven. Modularity clustering, as in the experiment on Groupon data, produces highly accurate clustering results over 0.90 in normalized mutual information for all three cases. Considering modularity clustering is an established method in community structure detection and scores well in all cases, it validates by implication our sampling method in that our sampling method per se does not distort the node-group membership structure. The performance of hierarchical clustering seems to be highly dependent on the clustering coefficient of a graph. It performs relatively well in DBLP—graph high clustering coefficient, but fails in YouTube—graph with low clustering coefficient. Spectral clustering shows stable and reliable accuracy around 0.85 in normalized mutual information. In comparison, dependence clustering outperforms spectral clustering in all t not to mention hierarchical clustering. More important, it outperforms modularity clustering beyond t ¼ 2. Since modularity clustering is a cutting-edge algorithm in community structure detection, we highlight the comparison between the two in Fig. 10. Since the performance of modularity clustering is set to be constant over all t, all fluctuation patterns in t are attributable to the performance of dependence clustering. Especially in the YouTube data set, dependence clustering performance fluctuates in the range of t 2 f3; 4; 5; 6g. We suspect this fluctuation is due to the periodicity of Markovian transitions in the underlying graph structure. This fluctuation does not alter the conclusion on the overall performance comparison between dependence clustering and modularity clustering.

4. Discussion and conclusions Community detection algorithms—developed recently relative to graph partitioning—opened up a new horizon in network cluster analysis. Since Newman’s pioneering modularity-based method [4], community detection algorithms employed by researchers in many fields have generated additional insights to the standard network analysis. Our study adds to the literature another community detection algorithm with a new ability to adjust the scope of network analysis. Dependence clustering—proposed in this study—shows comparable performance to modularity clustering even under the basic setting of t ¼ 1 in the simulation tests. Although we confirm the performance of modularity clustering is accurate, we show that dependence clustering outperforms the other three compared methods (modularity, hierarchical, and spectral clusterings) as the connectivity scale parameter t varies depending on the context of real-world application. In this study, we showed some similarities to spectral clustering and modularity clustering in the form of objective function to optimize. We admit that a deeper connection between the proposed method and those methods will be a future research topic. Sensitivity of the clustering result of dependence clustering on connectivity scale parameter t deserves another thread of discussion. We observe the performance in normalized mutual information fluctuated over t depending on whether t is odd or even in the experiment on the YouTube data set. Though the level of fluctuation is not large enough to undermine our conclusion on the performance of dependence clustering, the pattern is intriguing. Considering that this fluctuation pattern does not appear in the

71

other Amazon and DBLP data sets, we surmise the fluctuation is due to idiosyncratic periodicity of the YouTube data set. Thus we notice that investigation on how to choose connectivity scale parameter t will be another future research topic. The advances in information technology and social networking services have enabled accumulation of large-scale graph data. As such large-scale network data become more abundant, our evaluation method provides a guide to utilize such large data sets in network analysis research. It also poses a future research question on how sampling network data differently could favor certain types of algorithm under evaluation and affect comparison of different methods. Although modularity clustering has merits in that it produces a definite clustering result without requiring any parameters a priori, methods taking input parameters are not necessarily inferior to those without parameters. More important question would be what the nature of a parameter is and how the parameter serves the purpose of clustering analysis. The most usual parameter required by traditional clustering methods—including hierarchical and spectral clustering based on k-means—is the number of clusters you want to attain. Dependence clustering has a unique value proposition that the user can set the level of detail as one desires by changing the connectivity scale parameter, t. Instead of giving a definite answer like modularity clustering, our method can potentially serve better as an exploratory tool that allows users to change the resolution of analysis. Since the true number of underlying communities in a real-world application is rarely known beforehand, the ability to vary the level of detail can be as important as a definite answer on the number of clusters. In sum, we want to stress that each clustering method has different advantage. Most intuitive is the hierarchical clustering as the user can conceptually visualize the dendrogram—tree structure of similarity among nodes. Spectral clustering is useful for the graph partitioning tasks with the predefined number of clusters. Modularity clustering can detect the unknown number of communities in a graph and divide the graph accordingly. Dependence clustering has the ability to vary the level of connectivity scale or the scope of analysis with which the user explores a network data with unknown structure. This ability of dependence clustering is rather unique in the literature of network analysis and data mining. Acknowledgements This research was supported by a grant from the R&D Program (Industrial Strategic Technology Development) funded by the Ministry of Knowledge Economy (MKE), Republic of Korea: the Grant No. 10042693 and the grant title Socio-Cognitive Design Technology for Convergence Service. Also, the authors are deeply thankful to all interested persons of MKE and KEIT (Korea Evaluation Institute of Industrial Technology). References [1] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2006. [2] G. Gan, C. Ma, J. Wu, Data Clustering, SIAM, Society for Industrial and Applied Mathematics, 2007. [3] Y. Jung, H. Park, D. Du, B. Drake, A decision criterion for the optimal number of clusters in hierarchical clustering, J. Global Optim. 25 (1) (2003) 91–111. [4] M. Newman, Modularity and community structure in networks, Proc. Natl. Acad. Sci. 103 (23) (2006) 8577–8582. [5] K. Lee, A. Gray, H. Kim, Dependence maps, a dimensionality reduction with dependence distance for high-dimensional data, Data Min. Knowl. Discovery 26 (3) (2013) 512–532. [6] W.W. Zachary, An information flow model for conflict and fission in small groups, J. Anthropol. Res. 33 (4) (1977) 452–473. [7] D. Mari, S. Kotz, Correlation and Dependence, World Scientific Publishing Company, 2001.

72

H. Park, K. Lee / Knowledge-Based Systems 60 (2014) 58–72

[8] R. Nelsen, An Introduction to Copulas, Springer Verlag, 2006. [9] G. Shmueli, N. Patel, P. Bruce, Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner, Wiley, 2010. [10] K. Ahn, Effective product assignment based on association rule mining in retail, Expert Syst. Appl. 39 (2012) 12551–12556. [11] G.H. Golub, C.F. Van Loan, in: Matrix Computations, vol. 2, Johns Hopkins University Press, 2012. [12] E. Stone, J. Ayroles, Modulated modularity clustering as an exploratory tool for functional genomic inference, PLoS Genet. 5 (5) (2009) e1000479. [13] W.M. Rand, Objective criteria for the evaluation of clustering methods, J. Am. Stat. Assoc. 66 (336) (1971) 846–850.

[14] A. Strehl, J. Ghosh, Cluster ensembles – a knowledge reuse framework for combining multiple partitions, J. Mach. Learn. Res. 3 (2003) 583–617. [15] M. Bastian, S. Heymann, M. Jacomy, Gephi: an open source software for exploring and manipulating networks, in: International AAAI Conference on Weblogs and Social Media, 2009, pp. 361–362. [16] S. Martin, W.M. Brown, R. Klavans, K.W. Boyack, Openord: an open-source toolbox for large graph layout, in: SPIE Conference on Visualization and Data Analysis (VDA), 2011, pp. 786806–786806-11. [17] J. Yang, J. Leskovec, Defining and evaluating network communities based on ground-truth, in: Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, ACM, 2012, 3 p.