Pattern Recognition 88 (2019) 518–531
Contents lists available at ScienceDirect
Pattern Recognition journal homepage: www.elsevier.com/locate/patcog
Diffusion network embedding Yong Shi b,c,d,e, Minglong Lei a, Hong Yang f, Lingfeng Niu b,c,d,∗ a
School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, 100190, China School of Economics and Management, University of Chinese Academy of Sciences, Beijing, 100190, China c Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, 100190, China d Research Center on Fictitious Economy & Data Science, Chinese Academy of Sciences, Beijing, 100190, China e College of Information Science and Technology, University of Nebraska at Omaha, NE, 68182, USA f Center for Artificial Intelligence, University of Technology Sydney, NSW, 2000, Australia b
a r t i c l e
i n f o
Article history: Received 11 July 2018 Revised 1 November 2018 Accepted 6 December 2018 Available online 11 December 2018 Keywords: Network embedding Cascades Diffusion process Network inference Dimension reduction
a b s t r a c t In network embedding, random walks play a fundamental role in preserving network structures. However, random walk methods have two limitations. First, they are unstable when either the sampling frequency or the number of node sequences changes. Second, in highly biased networks, random walks are likely to bias to high-degree nodes and neglect the global structure information. To solve the limitations, we present in this paper a network diffusion embedding method. To solve the first limitation, our method uses a diffusion driven process to capture both depth and breadth information in networks. Temporal information is also included into node sequences to strengthen information preserving. To solve the second limitation, our method uses the network inference method based on information diffusion cascades to capture the global network information. Experiments show that the new proposed method is more robust to highly unbalanced networks and well performed when sampling under each node is rare. © 2018 Elsevier Ltd. All rights reserved.
1. Introduction Network representation learning [1,2] is widely used in analyzing large networks such as social networks and academic citation networks. The principle is to project network nodes in the original topological feature space into a low-dimensional Euclidean space while minimizing information loss during the feature transformation [3,4]. To date, network embedding has shown its advantage in improving the performance of network classification [5,6], anomaly detection [7] and community detection [8,9]. Early network embedding methods often formulate the problem as dimension reduction and analyze the network adjacent matrix that is deterministically defined before calculation. Therefore, they can only handle static network structures. Recently, graph sampling [10,11] has been widely used for embedding learning, where random walks are introduced to preserve network structure information. For example, node2vec [12] explores network structures by using the breadth-first sampling (BFS) and depth-first sampling (DFS) methods. Graph sampling generates node sequences which are then fit into skip-gram [13] to vectorize the network structure into a low-dimensional Euclidean space.
∗
Corresponding author. E-mail addresses:
[email protected] (Y. Shi),
[email protected] (M. Lei),
[email protected] (H. Yang),
[email protected] (L. Niu). https://doi.org/10.1016/j.patcog.2018.12.004 0031-3203/© 2018 Elsevier Ltd. All rights reserved.
However, random walk methods have two limitations. First, stochastic sampling requires multiple attempts to fairly describe the neighbors of nodes. In random walks, the success of preserving network structure information highly depends on the repeated samplings imposed on each node. Therefore, these methods are unstable when either the sampling frequency or the total number of node sequences change. Second, encoding local structure node sequences to low-dimensional representations by skipgram [13] is an end-to-end process. The skip-gram model uses embedding-lookup and directly decodes the locally explored information in random walks into the embeddings [14]. In highly biased networks where the majority of nodes are either with extremely high degrees or extremely low degrees, local sampling methods are likely to bias towards high-degree nodes [15] and often perform poorly due to the lack of global network information. To solve the above two shortcomings, we present in this paper a diffusion based embedding model that can dynamically detect network structures [16–19]. Specifically, the diffusion embedding method can be considered as a two-step framework which consists of a diffusion sampling step and a diffusion embedding step. In the diffusion sampling step, our method simulates the information diffusion process and generates a collection of node sequences. Without tuning the parameters between BFS and DFS, the diffusion provides an intuitive way to detect both width and depth structure information. Since more nodes are involved in the pro-
Y. Shi, M. Lei and H. Yang et al. / Pattern Recognition 88 (2019) 518–531
cess of sampling, our method can better capture local information than traditional random walks given the same walk length and sampling frequency. Compared to traditional random walks, our method considers temporal information in networks and includes the temporal dimension in the sampling sequences. In the diffusion embedding step, based on the diffusion cascades, we dynamically infer the network structure [17] and obtain a network adjacent matrix. Instead of directly encoding the sampling sequences, network inference runs over the entire network. The adjacent matrix can capture the global and even latent structure information. Diffusion embedding is consequently more robust to unbalanced network structures. Then, we can solve a matrix factorization problem to obtain the low-dimensional representations. The contributions of this paper are summarized as follows,
519
The remainder of the paper is organised as follows. Section 2 reviews popular network embedding methods and quickly walks some preliminary concepts. Section 3 introduces the details of diffusion sampling procedure and diffusion embedding procedure. This leads us in Section 4 to formulate the complete algorithm and corresponding analysis. Section 5 presents the experiments in multi-class classification tasks and follows the analysis of the results. We conclude our work in Section 6.
are the hash retrieval methods [31,32] in which the data similarity is kept in the hash codes so that the retrieval task can be easily launched. It is noticeable that the networks are naturally formed comparing with the off-line constructed graphs. Consequently, specifically designed methods are required for network embedding task. Recently, graph sampling methods have been widely used in network embedding. Instead of measuring the proximity of graphs by deterministic edges, sampling methods capture the node proximity by using stochastic measure [14]. Nodes that appear in the same node sequence have similar representations. DeepWalk [33] and node2vec [12] adopt different sampling strategies to sample local sequences. The limitations of random walk sampling have been carefully discussed in works such as [34]. Those local node sequences are directly mapped into the latent representation space and hence fail to capture the global information. Another typical work, LINE [35], is a large-scale information embedding method which designs the loss function that captures both first-order and second-order proximity information. LINE is also treated as a stochastic embedding model because it optimizes a probabilistic loss function. However, LINE still suffers from the challenge of potentially losing the presentation of global network information. There are several attempts that introduce the global information into the network embedding [24,34]. In the work [34], the local subgraphs similarity is served as global information to enhance the embedding quality. The diffusion method addresses the global information by cascades and network inference. We directly inference a global-wise weight matrix which also describes the latent structures of the network. Additionally, our work is similar to cascade based network embedding methods such as COSINE [19]. The COSINE is a dynamic embedding method which is based on cascades from the real world. Its main idea is to use Gaussian mixture model to preserve the community information in the network. In contrast, our work simulates the diffusion process under static networks to grasp structure information.
2. Related work
2.2. Random walk and diffusion
2.1. Network embedding
We briefly introduce the basic ideas of random walks and diffusion processes. Random walks and diffusions are originally studied in physics that describe the molecule movements. The original random walk refers to a discrete stochastic process [36]. For example, a typical random walk is defined over an integer sequence line with a half probability going right x = 1 and the other half probability going left x = −1. Denote the location of the walker at time step t as Xt , then Xt = Xt−1 + x. However, the diffusion is defined with a continuous space and continuous time by a stochastic differential equation dXt = μ(Xt )dt + δ (Xt )dWt , where W describes the Brownian motion which is highly related to random walk models, δ is the diffusion coefficient and μ is the drift term [36]. Obviously, diffusions possess the randomness of ransom walks and are subject to more sophisticated stochastic rules [36]. In the network analysis, random walks are basic blocks of diffusion processes such as epidemic spreading [37] and opinions propagation [38].
•
•
•
We present a new network embedding method based on information diffusion in networks. Different from previous random walks, our method records all the visited nodes and transforms the single-trace random walks into multiple-trace random walks. We present a new strategy to capture the structure information of networks. Given the same walk length and sampling frequency, our method can better capture local information than traditional random walk methods. We conduct experiments on both synthetic and real-world datasets with respect to node classification tasks. The results show that our method outperforms baselines.
Network embedding is a subtopic of representation learning in networks. Early methods such as Laplacian Eigenmaps(LE) [20], Local Linear Embedding(LLE) [21] and IsoMAP [22] are served as dimension reduction techniques which are not originally designed for networks. In these methods, graphs are constructed off-line by computing the distances of node attributes. The representations are obtained by solving the eigenvectors of adjacent matrices or their variations. Basically, the way to improve the quality of embeddings is to introduce more information such as graph statistics [23] and high order information [24]. On the other hand, techniques in other applications can also be introduced to discover the graph patterns [25–29]. Besides the dimension reduction [1], an intuitive solution is to use matrix factorization [24,30] to obtain low-dimensional representations. Similar to previous dimension reduction methods, the graph proximity is revealed directly by an adjacent matrix or its variations. One basic assumption on these methods is that the connection between any two vertices is denoted as dot product of their low-dimensional embeddings [14]. Generally speaking, matrix based methods are a trade-off between different orders of network structural information. Since they mainly depend on visible and deterministic connections of vertices, the latent structures in real-world networks cannot be properly exploited. Generally, embedding real world networks demands models to preserve actual network proximity in the derived features while reduce the dimension. Examples that have the same motivation
3. The model 3.1. Framework and notations Let us first introduce the basic framework of our model. The goal of network embedding is to learn an explicit or intrinsic function that maps nodes into vectors. The vectors are required to reveal the proximity in the node domain. Concretely, the network embedding contains a proximity detecting phase and a modeling and
520
Y. Shi, M. Lei and H. Yang et al. / Pattern Recognition 88 (2019) 518–531
Fig. 1. The framework of diffusion based network embedding.
Table 1 Summary of notations. Symbol
Definition
N K Tc vi ti Uvi := (Uv1i , Uv2i , · · · , UvKi ) tc := (t1c , · · · , tNc )
The number of vertices The number of node sampling steps The observation window size for cascades The i-th vertex The time that vertex vi becomes active The node sampling sequence that begins in vertex vi A cascade that represents the infected times for all node The transformation rate between vertex vi and vertex vj
α i, j
mapping phase. The proximity detecting phase discovers and defines the patterns of the proximity. Then, the modeling and mapping phase encodes the proximity to a low dimensional space. The deterministic network embedding and stochastic network embedding use different strategies in the two phases. While graphs in deterministic models are generally represented as matrices with fixed sizes, stochastic models are flexible and treat the graph as a repository from which a bunch of node sequences can be sampled. The stochastic models are easy to parallelize, which largely relieves the high computational cost in deterministic models. In addition, the modeling and mapping phase in stochastic network embedding models is usually based on probability models. In this paper, we propose a stochastic network embedding model in which the data and model are both stochastic. We illustrate in Fig. 1 the framework of our model. Our proximity detecting phase is a diffusion sampling procedure where the proximity and structural information is encoded in the cascades by stochastic samplings. The cascades are obtained by simulating the diffusion process under the network. The diffusion process improves the random walk in its ability in capturing proximity information. The co-occurrences of nodes in cascades are supposed to have similar representations. After formulating a collection of cascades, our modeling and mapping phase is a diffusion embedding procedure which maps the information in the cascades into the final representations. The transmission rates that accurately describe the connections of nodes are modeled by probability models. At the last step, we factorize the transmission rates into the representations of nodes. In the following subsections, we will introduce our realization of the diffusion sampling and the diffusion embedding at length. Before the elaboration of the model, we first summarize in Table 1 the notations and definitions of some symbols that are used in the following sections.
3.2. Diffusion sampling procedure The success of stochastic models depends on the properties of sampling strategy under graphs. The sampling process generates sequences that represent the proximity of nodes. We elaborate the improvements of diffusion based samplings over random walk samplings in the rest of this subsection. In this paper, the Diffusion Sampling Procedure defines a series of stochastic samplings conducted over networks that encode network structure information into sequences. Given a network G(V, E) with a vertex set V: {v1 , , vN } and an edge set E : {vi, j }N , i, j=1 the diffusion sampling procedure operates over the graph by node samplings and time samplings. The aim of diffusion sampling procedure is to keep the neighborhood information and node position information in a collection of information cascades.
3.2.1. The diffusion process Compared with random walk sampling, the diffusion process in networks generates more informative traces. In this part, we introduce the node sampling and explain how the diffusion process improves random walks in maintaining the network structure information. First, we briefly explain the diffusion process. Similar to the molecular diffusion in fluid where particles move from high concentration areas to low concentration areas, a network can be regarded as a system that changes from unstable states to stable states throughout the diffusion process [39]. Initially, the network is an unbalanced system where only a few nodes are active. Since there are information gaps among nodes, the diffusion process starts when information is delivered from active nodes to inactive nodes. The system will be stable when the information is evenly distributed.
Y. Shi, M. Lei and H. Yang et al. / Pattern Recognition 88 (2019) 518–531
521
a four-step walk, it generates a node sequence (v1 , v2 , v3 , v2 ) which reflects local network structures. In Fig. 2(b), we simulate a diffusion process under a directed graph that also starts at node v1 . The difference between the above two random samplings is whether the diffusion process is recorded and whether an active node stays active after being visited. The diffusion process generates a sequence of node sets ((v1 ), (v1 , v2 ), (v1 , v2 , v3 ), (v1 , v2 , v3 )) revealing the evolution of active nodes. As illustrated in Fig. 2, if we record a node sequence within K = 5 steps, the node v4 in the random walk sequence will not be visited since the walker has passed through all neighbors of v4 . However, since the diffusion process stores all nodes that have been visited previously, v1 , v2 , v3 will stay active at the same time, the diffusion walker is still possible to visit v4 when k = 5. Obviously, at each time when we start a sampling during a diffusion process, different from random walks where only one node is selected as seed, an activated node can sample multiple times and visit more neighbors. The walker is then improved from a single-sampling trace to a multiple-sampling trace and can better exploit local structure information.
Fig. 2. An illustrative example of random walk and diffusion process in a fournode directed graph. When the walking starts, the diffusion process has more active nodes than random walk at each step. The diffusion process generates multi-trace node sequences and includes more information than random walk node sequences.
Different from single-trace random walks, node sampling process of the diffusion in network G(V, E) generates node sets that describe dynamics of the participated nodes. Concretely, we choose a random node vi (vi ∈ V) as the seed to start a diffusion process. Suppose the maximal walking step is K. Given an arbitrary step k ∈ K, Uvki denotes a subset of V which includes all nodes that are
active in the current step. In step k + 1, all nodes in Uvki are served as seeds to start node samplings. We use set kvi to denote the newly infected nodes generated from Uvki . Note that nk nodes in Uvki will sample nk nodes, each of which is uniformly selected from the neighbors of a corresponding node in Uvki . However, kvi only contains nodes that have not appeared in previous steps. Then Uvki
updates to Uvki+1 by adding new infected nodes in step k. There-
fore, for each node vi , we obtain vi := (1vi , 2vi , · · · , Kvi ) and
Uvi := (Uv1i , Uv2i , · · · , UvKi ), where K is the length of walking steps. We compare traditional random walks and the diffusion process in a small directed network with four nodes, as illustrated in Fig. 2. The walking step K is set to be four for simplicity. In Fig. 2(a), we first launch a random walk starts at node v1 . In each walking step, the walker moves to its neighbors with uniform probabilities. The gray circle indicates the current node that the walker reaches. After
3.2.2. The formulation of cascades The simulation of diffusion under a network is step discrete and time continuous. Besides the node sampling, we introduce in this part the time sampling that describes the continuous aspect of the diffusion process. The time information can perceive the latent and global structures of the network. Generally, the transmission of knowledge or disease is not instantly happened in the diffusion process [40]. Another important feature in a diffusion process to discover the structural information is the time stamp when a node receives the network propagation information. For example, in the epidemiology scenario [37], time stamps indicate the times when nodes get infected by diseases. We use information cascades [41] to record the infected time stamps of nodes during the diffusion. Models based on cascades have been popularly used in network applications such as recommendation systems [42,43]. To discover network structures, cascades record flows of information in networks [17]. Consider a graph with N vertices, if we fix an observation window [0, Tc ], a cascade is defined as a N-dimensional vector tc := (t1c , · · · , tNc ) where each element of the vector records the first infection time of the corresponding node. For any n ∈ [1, N], tnc ∈ [0, T c ] ∪ {∞}, where ∞ implies that the node is not observed within the observation window. Each node in the cascade has been attached with a time stamp. Then a collection of |C| cascades can be represented as {t1 , , t|C| }. Note that each time a cascade is generated, the time stamp is reset to 0. For simplicity, the observation windows are set equal-timing for c ∈ [1, |C|]. We use the symbol Tc to denote the observation windows for all cascades. Recall that we can obtain a collection of node sets Uvi for vi ∈ V by using the node sampling process. In order to formulate cascades by previous node sets, we introduce the time interval sampling strategy. The transmission time between two nodes can be depicted by proper transmission time models [18]. The most common models in present are power law model in which the time interval t is subject to d (t ) = (ξ − 1 )t (−ξ ) and exponential model in which the time interval t is subject to d (t ) = ξ e−ξ t . The time interval samplings are launched along with the node samplings. In detail, for each node that is active in the current step, we sample a neighbor of the current node as the next location of the diffusion walker, and then sample a time interval t from time models as the transmission time of the diffusion. The time stamp of the initial node vi is set to be 0. As the walking proceeds, newly infected nodes from Uvki to Uvki+1 , namely kvi , will be assigned with time stamps based on the sampled time intervals and the time stamps of their source nodes. Notice that an already
522
Y. Shi, M. Lei and H. Yang et al. / Pattern Recognition 88 (2019) 518–531
activated node cannot be infected again even the walker can revisit it. Assume that two nodes in Uvki share a same neighbor and they walk to the shared neighbor in step k + 1, the time interval sampling will be conducted twice and the smaller time stamp value will be assigned to the neighbor node. In other words, we only record the time of its first infection. We use Uvt i as the set to record all active nodes with time stamps in a K-step diffusion process. The Uvt i updates itself by adding new infected nodes with time stamps from k to k + 1. Then, we can select nodes in Uvt i that lie in the observation window Tc v
v
to formulate a cascade tvi := (t1 i , · · · , tNi ). The cascades collection {t1 , , t|C| } is then obtained by conducting samplings in different nodes. Also recall that the update of Uvi is actually the evolution of the network. The joining of new nodes and edges at each step implies the dynamics of the network. After incorporating an additional dimension of information into the node sequences, the cascades derive temporal features. Given a cascade tc , the time stamp values explicitly depict the orders of nodes. In this way, the time stamps attached to the nodes represent their global positions during the diffusion process. Consequently, a certain amount of cascades can be used to inference the latent and global structures of the network. 3.3. Diffusion embedding procedure
Cascades are used to reposit the neighborhood information as well as the node positions. The sampling procedure in the previous section explains how information is encoded into the cascades. Now we explain how to decode information in cascades into embeddings. Formally, given |C| cascades, the diffusion embedding procedure aims to learn a function f that maps each vertex vi of V to a vector yi ∈ Rd , where d |V|. The mapping function is supposed to decode the neighborhood information and the node positions into the representations. We design a two-stage optimizing process that solves latent network structures based on cascades and low-dimensional network representations. The first component is a network inference process that results in an accurate description of the network. The second component is a matrix factorization process functioning as dimension reduction. 3.3.1. Network inference After obtaining |C| cascades, we wish to transform them into an accurate network description that contains both global and local structure information. The problem turns to solve a specific optimization problem based on the network inference method [16–18]. Network inference discovers latent structures that are not explicitly observed in networks. The latent structures reveal interpretable patterns of connections. We start with the definition of the pairwise transformation likelihood of any two nodes. For an arbitrary node vj , the probability that it is infected by an active node vi is defined as a likelihood function f(tj |ti , α i, j ), where α i, j is the information transformation between two nodes and ti , tj (ti < tj ) are the times of the information arrival of nodes vi and vj respectively. Take the exponential function for example, the parametric form of the conditional likelihood can be written as,
αi, j · e−αi, j (t j −ti ) f (t j |ti , αi, j ) = 0
i f ti < t j otherwise
(1)
Furthermore, the cumulative density function F(tj |ti , α i, j ) can be computed from f(tj |ti , α i, j ). Then, the probability that node vj is not infected by an already infected node vi can be defined as S(t j |ti , αi, j ) = 1 − F (t j |ti , αi, j ). The likelihood of observing a cas-
cade is given by,
f ( t; A ) =
S(T |t j ; α j,l )
t j ≤T tl >T
×
S(T |t j ; αk, j )
k:tk
H (t j |ti ; αi, j )
(2)
i:ti
f (t |t ,α
)
where the H (t j |ti ; αi, j ) = S(t j |t i,α i, j ) is the instantaneous transmisj i i, j sion rate from node vi to node vj . For a set of |C| independent cascades, we define the joint likelihood as,
f ( tc ; A )
(3)
tc ∈C
In all, by estimating the transmission rate α i, j via maximizing the set of |C| independent cascades, the network inference problem reduces to,
min A
s.t.
−
c∈C
log f (tc ; A ) + β1 A 2F
αi, j ≥ 0, i, j = 1, · · · , N, i = j
(4)
where β 1 is a positive hyperparameter for the regularization term. Denote the optimal solution of Eq. (4) as A : {αi, j |i, j = 1, · · · , N, i = j}. Each element of A is the transformation rate of two nodes. It derives the weight which reflects the degree of connection strength between each pair of nodes. Since the inference runs under the entire network, the resulting weight matrix A is able to capture the global information of the network. 3.3.2. Matrix factorization The weights between nodes describe the information transformation probabilities. Even this description is informative and unabridged, a compact and dense representation is still necessary for further downstream applications. Generally, the learning process of low-dimensional representations can be formulated as optimizing the cost function [44,45] as follows,
O = L ( X, Y ) + R ( Y )
(5)
where X and Y are the inputs and representations. L is the loss function that measures the approximation of Y to X, and R is the regularization term that constrains the low-dimensional representations. Although the computation of weight matrix differs, the learning models mainly fall into two categories given their different strategies in L. Deep models [46,47] achieve good results in some scenarios but they contain large number of parameters and are not easily optimized in the network heading. Another category of methods attempts to decompose the matrix by matrix factorization techniques such as SVD [24,48]. The key notion behind the factorization framework is that the weight of any two nodes depends on the low-dimensional representations of the corresponding nodes [49]. Denote the similarity of vi and vj as sG (vi , vj ). The factorization framework attempts to optimize the loss between sG (vi , vj ) and the decoding function fDEC (yi , yj ), where yi and yj are the representation vectors of vi and vj [14]. The node vectors learned from their weights by factorization framework are more interpretable. The work [50] has also proved that DeepWalk is equivalent to factorizing a network graph matrix. We follow the work [50,51] and use the matrix factorization to factorize our similarity matrix. The Eq. (5) can be crystallized as follows,
min Y,W
O (Y, W ) = J (A ) − YWT 2F + β2 ( Y 2F + W 2F )
(6)
where the term J(A) is an accumulation function of different orders of A. Given A, optimizing O (Y, W ) depends on the partial deriva-
Y. Shi, M. Lei and H. Yang et al. / Pattern Recognition 88 (2019) 518–531
tives. For Y, the ∂ O (∂YY,W ) can be represented as follows,
Algorithm 1 Diffusion Based Network Embedding.
∂ O (Y, W ) ∂ J (A ) − YWT 2F + β2 ( Y 2F + W 2F ) = ∂Y ∂Y ∂ T r (AT A + WYT YWT − 2J (A )T YWT ) + β2 (T r (YT Y ) + T r (WT W )) = ∂Y ∂ T r (WYT YWT − 2J (A )T YWT ) + β2 T r (YT Y ) = ∂Y ∂ T r (YT YWT W − 2YWT J (A )T ) + β2 T r (YT Y ) = ∂Y = 2(YWT W − J (A )W + β2 Y )
(7) where Tr(X) is the trace of X. The first term in the fourth line fol∂ T r (XT1 X1 X2 ) = X1 (XT2 + X2 ), and the second term in ∂ X1 ∂ T r ( X1 X2 ) the fourth line follows the rule = XT2 . ∂ X1 Similarly, we can get the ∂ O∂(YW,W ) as follows,
lows the rule
∂ O ( Y, W ) = 2(WYT Y − J (A )T Y + β2 W ) ∂W
523
(8)
Now we focus on the selection of J(A) where Ji, j (A) is gener[e (A+A2 +···+Am )]
j ally defined as i , where ei is an indication vector m that only has one non-zero value, 1, in the i-th index. The first divergence between different matrix factorization models comes from how they define the similarity between any two nodes, which leads to different options of A [14]. For instance, the similarity in [52] is defined based on Jaccard neighborhood overlaps. Other options such as [24,30] are laplace driven and compute their similarity matrices by the adjacent matrices. However, the matrix A inferred from cascades accumulates different aspects of network information. Another divergence is the selection of first-order similarity [30] and high-order similarity [24], which leads to different options of J. The first-order similarity has J (A ) = A and considers linear neighborhood functions. However, high-order similarity corresponds to more general neighborhood overlapping functions [14]. In practice, the trade-off should be made between the lower computation cost in first-order similarity and the higher information completeness in high-order similarity. Following the work [50,51], we set J (A ) = (A2 + A )/2 in the experiments.
Input: graph G(V, E ) diffusion steps K representation size d time window size T c number of samplings under each vertex τ number of orders m Output: |C | cascades {t1 , · · · , t|C| } weight matrix A network representations in a matrix form Y 1: for i = 1 to τ do = Shu f f le(V ) 2: for vi ∈ do 3: Run the node sampling process (G, vi , K) to obtain vi , Uvi 4: 5: Run the time sampling process (vi , Uvi ) to obtain Uvt i 6: Formulate a cascade tvi from Uvt i with the time window Tc end for 7: 8: end for the network inference problem 9: Minimizing − c∈C log f (tc ; A ) + β1 A 2F to obtain the transmission matrix A (A+A2 +···+Am ) with high-order information 10: Calculate J (A ) = m 11: repeat O (Y, W ) = J (A ) − YWT 2F + β2 ( Y 2F + W 2F ) 12: Update parameters based on the partial derivatives: ∂ O (∂YY,W ) and ∂ O∂(YW,W ) 14: until convergence 15: Return representations Y
13:
4.3. Theoretical comparison with random walk methods In the diffusion based model, the transmission rate is factorized as αi, j = fDEC (yi , y j ), where yi and yj are the diffusion embeddings of vi and vj , and fDEC is a decoding function that connects the embeddings and the node similarities. Furthermore, the conditional transmission probability based on α i, j is written as f(tj |ti , α i, j ) in Eq. (1). Substitute the fDEC for α i, j , the f(tj |ti , α i, j ) can be written as f(tj |ti , fDEC (yi , yj )), the function of diffusion embeddings. Similarly, in random walk based models, the probability that vi visits vj is roughly defined as f(vj |vi ) with a softmax [14],
4. Algorithm and analysis 4.1. Algorithm Solving the optimization problem in (6) results in a compact vector-wise representation of all the nodes. The entire algorithm for completeness is listed in Algorithm 1.
4.2. Time complexity analysis Suppose the diffusion sampling procedure generates |C| cascades. For each cascade, the maximal number of active nodes is L. Then, in the worst case, the complexity of diffusion sampling is O(|C|LK). According to Eq. (4), the complexity of network inference is O(|C |(L3 (N − L ))). The complexity of matrix factorization is O(N2 d), where d is the dimension of learned features . In all, the time complexity of the algorithm is O(N 2 d + |C |(L3 (N − L )) + |C |LK ). Since |C|, d and K are constants, L is related to the time window Tc and is often controlled as a small number, our model is regarded as squared to the number of nodes. The overall computational cost is acceptable.
f ( v j |vi ) =
e fDEC (zi ,z j ) . fDEC (zi ,zk ) vk ∈V e
(9)
where the zi and zj are the random walk embeddings of vi and vj . The underlying assumption of diffusion model and random walk models is that nodes in the same node sequence (cascade or random walk) are with high similarities and should have similar representations. The transmission likelihood models are used to approximate the node similarities. On the other hand, the transmission likelihood can be expressed by the decoding function of node representations. In the optimization stage, both kinds of models attempt to maximize the co-occurrence probabilities. The difference of the two kinds of models is also straightforward. The transmission probability model f(tj |ti , α i, j ) in diffusion is constructed under two-dimension samplings, i.e., the node neighboring sampling and the time interval sampling. However, in random walk methods, the transmission probability f(vj |vi ) only considers node samplings. We can view random walk methods as the single-trace and one-dimension simplifications of the diffusion model.
524
Y. Shi, M. Lei and H. Yang et al. / Pattern Recognition 88 (2019) 518–531 Table 2 Statistics of datasets. Dataset
|V|
|E|
|Label|
SYNTHETIC WIKI CORA CITESEER DBLP
10 0 0 2405 2708 3312 2760
3876 17,981 5429 4732 3818
5 17 7 6 4
5. Experiments
3.
4.
In this section, we conduct experiments on synthetic and real network datasets in node classification task to testify the effectiveness of the proposed model. 5. 5.1. Datasets 1. Synthetic Network. The synthetic network contains 10 0 0 nodes that belongs to 5 categories. The synthetic network generator is implemented by following the work [53]. 2. Wiki Network [50]. The Wiki network is a collection of 2405 websites from 17 categories. There are 17,981 links between them. The edge between two pages indicates a hyperlink. 3. Citeseer Network [5,53]. Citeseer1 is a popular scientific digital library and academic search engine. Citeseer network is a document based citation network that consists of 3312 scientific papers. All of those publications can be classified into 6 different groups: Agents, Artificial Intelligence, Database, Human Interaction, Machine Learning and Information Retrieval. This network consists of 4732 links that describe the citation relations between different papers. The papers in the dataset are selected so that each paper will cite or be cited by at least one paper. 4. Cora Network [5,53]. The Cora network is a citation network that consists of 2708 machine learning publications classified into one of 7 classes. This network consists of 5429 links indicating the citation relations between papers. 5. DBLP Network [54]. The DBLP dataset is a co-authorship network that concentrates on computer science publications. The original DBLP dataset in [55] is used for community evaluation. Following the experiment settings in [54], we construct a fourarea dataset which contains Machine Learning, Data Mining, Information Retrieval and Database. For each area, five representative conferences are selected. Authors that published papers in those conferences are selected and constructed as the author network. The Cora network originally used in [5] includes both attribute information and connection information. In this paper, we remove the attribute information and only consider the 0–1 link information. The statistics of the datasets are summarized in Table 2. 5.2. Baseline methods To evaluate the performance of our method, several state-ofthe-art network embedding methods are utilized as baselines. 1. DeepWalk [33]. DeepWalk is a random walk based method that aims to learn low dimensional representations for networks. The random walk is served as network structure detector to obtain a collection of node sequences. The skip-gram is then adopted to obtain the final vector-wise representations. 2. node2vec [12]. node2vec improves Deepwalk in its strategy of random walks. Specially, node2vec extend the one-dimension 1
http://citeseer.ist.psu.edu/index.
6.
7.
search in DeepWalk to a two-dimension search (BFS and DFS) balanced by two hyper-parameters p and q. Identical to Deepwalk, node2vec uses skip-gram to get the node representations. GraRep [24]. GrapRep is a matrix factorization based graph representation method which utilizes k-step information matrices of the graph. For each single step, GraRep learns a low dimensional embedding by imposing SVD on the matrix that indicates the given step information. The final representations with global information are concatenated from all k-step embeddings. LINE [35]. LINE is designed for large scale information networks. It attempts to preserve first-order structure information and second-order structure information by an explicit loss function. The final representations are concatenated from the firstorder information and second-order information. Spectral Clustering (spectral) [56]. Spectral clustering operates on the Laplacian matrix of graph G and can be utilized as a dimensional reduction method. The d dimensional representations are generated from the top d eigenvectors of normalized Laplacian matrix. Single Trace Diffusion (SingleDiff). This method is a single trace version of diffusion embedding. The cascades are formulated by traditional random walks attached with time stamps. The single trace cascades are fed into the diffusion embedding procedure to obtain the representations. The SingleDiff keeps the node sampling strategy in DeepWalk and combines time information to detect global information. Multiple Trace DeepWalk (MultiDW). This method is a multiple trace version of DeepWalk. The node sequences are obtained by diffusion sampling procedure. After obtaining the ordered sequences, the time stamps are eliminated. Similarly, the skip-gram is used to embed the sequences to the representations. The MultiDW keeps the mapping method of DeepWalk and improves its local sampling strategy.
5.3. Node classification We evaluate the proposed method under the task of node classification with respect to Synthetic network, Wiki network, Citeseer network and Cora network. The results are reported in Tables 3–6 respectively. The best results are highlighted in bold. Following the experiment procedure used in many network embedding literatures, we randomly sample a portion of nodes as the training set and the rest nodes as testing set. The training portion is varied from 10% to 90%. In order to facilitate the comparison, the embedding dimension are set to be d = 128 for our method and baseline models. The low-dimensional results of all embedding models are trained with one-vs-rest logistic regression provided in [57]. After obtaining the embeddings of all models, we run the supervised training procedure ten times and calculate the average performance in terms of both Macro-F1 and Micro-F1 for each model. Here, Macro-F1 is a metric which gives weight to each class, and Micro-F1 is a metric which gives weight to each instance. In our model, in order to grasp the network structures precisely, all vertices in the networks are treated as seeds. Specially, for each vertex, we start the diffusion τ times to enrich the training cascades. For Deepwalk, we set the window size as 10, number of walks for each node as 10, walk length as 40. For GraRep, the maximum matrix transition steps kmax is set to 4. 5.3.1. Synthetic The classification results of synthetic network are listed in Table 3. The proposed method obtains the best results compared
Y. Shi, M. Lei and H. Yang et al. / Pattern Recognition 88 (2019) 518–531
525
Table 3 Results on Synthetic. Measure
Methods
10%
20%
30%
40%
50%
60%
70%
80%
90%
Micro-F1
Diffusion MultiDW SingleDiff DeepWalk node2vec GraRep LINE spectral
0.7524 0.7248 0.5068 0.6323 0.7025 0.4720 0.7298 0.4757
0.7674 0.7417 0.6217 0.6963 0.7410 0.5665 0.7502 0.5800
0.7733 0.7551 0.6851 0.7202 0.7502 0.6308 0.7577 0.6274
0.7874 0.7646 0.7300 0.7440 0.7570 0.6646 0.7680 0.6693
0.7950 0.7682 0.7504 0.7618 0.7581 0.6924 0.7700 0.6896
0.7966 0.7702 0.7655 0.7720 0.7662 0.7070 0.7804 0.7090
0.80 0 0 0.7710 0.7833 0.7790 0.7725 0.7200 0.7840 0.7153
0.8100 0.7710 0.7880 0.7840 0.7727 0.7310 0.7855 0.7180
0.8399 0.7740 0.7890 0.7825 0.7970 0.7500 0.7960 0.7400
Macro-F1
Diffusion MultiDW SingleDiff DeepWalk node2vec GraRep LINE spectral
0.7518 0.7229 0.5011 0.6309 0.7007 0.4687 0.7286 0.4664
0.7659 0.7405 0.6857 0.6960 0.7400 0.5687 0.7492 0.5842
0.7728 0.7541 0.6857 0.7196 0.7491 0.6326 0.7567 0.6318
0.7853 0.7633 0.7297 0.7430 0.7564 0.6660 0.7669 0.6720
0.7862 0.7635 0.7505 0.7611 0.7573 0.6923 0.7692 0.6909
0.7961 0.7672 0.7653 0.7714 0.7654 0.7057 0.7794 0.7096
0.7991 0.7689 0.7814 0.7739 0.7713 0.7178 0.7821 0.7142
0.8074 0.7693 0.7822 0.7807 0.7715 0.7287 0.7845 0.7175
0.8268 0.7714 0.7864 0.7840 0.7918 0.7438 0.7955 0.7375
Bold: The best performances among all models. Table 4 Results on Wiki. Measure
Methods
10%
20%
30%
40%
50%
60%
70%
80%
90%
Micro-F1
Diffusion MultiDW SingleDiff DeepWalk node2vec GraRep LINE spectral
0.5810 0.5775 0.4240 0.4997 0.5790 0.5206 0.5782 0.5452
0.6083 0.6034 0.5354 0.5456 0.6072 0.5972 0.5902 0.6020
0.6191 0.6107 0.5975 0.5801 0.6136 0.6087 0.6026 0.6052
0.6485 0.6374 0.6173 0.6036 0.6349 0.6138 0.6089 0.6281
0.6517 0.6386 0.6256 0.6234 0.6502 0.6261 0.6119 0.6396
0.6564 0.6415 0.6328 0.6379 0.6543 0.6292 0.6243 0.6444
0.6621 0.6463 0.6415 0.6445 0.6553 0.6395 0.6321 0.6515
0.6734 0.6494 0.6423 0.6448 0.6580 0.6501 0.6376 0.6550
0.6752 0.6497 0.6589 0.6453 0.6585 0.6543 0.6490 0.6565
Macro-F1
Diffusion MultiDW SingleDiff DeepWalk node2vec GraRep LINE spectral
0.4079 0.3883 0.3141 0.3815 0.4032 0.4016 0.3838 0.3631
0.4473 0.4258 0.4104 0.4231 0.4409 0.4356 0.4042 0.4320
0.4671 0.4442 0.4381 0.4515 0.4560 0.4526 0.4228 0.4611
0.4811 0.4563 0.4486 0.4760 0.4694 0.4605 0.4302 0.4789
0.4969 0.4597 0.4693 0.4865 0.4794 0.4654 0.4358 0.4931
0.5084 0.4639 0.4723 0.5060 0.4826 0.4782 0.4434 0.5047
0.5145 0.4797 0.4821 0.5128 0.5044 0.4884 0.4530 0.5090
0.5230 0.4854 0.4885 0.5198 0.5080 0.4933 0.4636 0.5160
0.5490 0.4929 0.5113 0.5454 0.5177 0.5042 0.4873 0.5230
Bold: The best performances among all models. Table 5 Results on Citeseer. Measure
Methods
10%
20%
30%
40%
50%
60%
70%
80%
90%
Micro-F1
Diffusion MultiDW SingleDiff DeepWalk node2vec GraRep LINE spectral
0.5603 0.5429 0.4982 0.4880 0.5540 0.4116 0.5002 0.4079
0.5836 0.5736 0.5513 0.5350 0.5801 0.4768 0.5116 0.4865
0.5940 0.5863 0.5684 0.5470 0.5911 0.5257 0.5231 0.5282
0.6007 0.5933 0.5746 0.5587 0.5925 0.5558 0.5495 0.5571
0.6094 0.5968 0.5812 0.5660 0.6077 0.5762 0.5780 0.5798
0.6193 0.6003 0.5910 0.5702 0.6078 0.5904 0.5879 0.5914
0.6237 0.6019 0.5913 0.5708 0.6103 0.6038 0.5917 0.6143
0.6355 0.6061 0.5987 0.5741 0.6142 0.6120 0.5923 0.6195
0.6445 0.6216 0.6018 0.5756 0.6150 0.6231 0.5996 0.6238
Macro-F1
Diffusion MultiDW SingleDiff DeepWalk node2vec GraRep LINE spectral
0.4457 0.4347 0.4304 0.4107 0.4079 0.3521 0.4253 0.3540
0.4866 0.4540 0.4531 0.4452 0.4473 0.4290 0.4652 0.4358
0.5025 0.4835 0.4833 0.4677 0.4671 0.4780 0.4734 0.4791
0.5090 0.4994 0.5024 0.4808 0.4810 0.5082 0.4793 0.5085
0.5428 0.5091 0.5190 0.4930 0.4942 0.5277 0.4921 0.5338
0.5614 0.5185 0.5324 0.5192 0.4973 0.5436 0.5331 0.5525
0.5709 0.5222 0.5325 0.5251 0.5065 0.5595 0.5515 0.5692
0.5736 0.5237 0.5352 0.5281 0.5186 0.5665 0.5534 0.5711
0.5801 0.5365 0.5418 0.5241 0.5201 0.5748 0.5557 0.5770
Bold: The best performances among all models.
with all baseline models with respect to both Micro-F1 and MacroF1. LINE and node2vec achieve favorable results in low training rates. The main advantage of diffusion based embedding is its performance in high training rates such as 80% and 90%. Noticeably, the MultiDW achieves more stable results than SingleDiff in both Micro-F1 and Macro-F1. This indicates that the improvement of local information detecting is more effective than global information detecting in balanced networks such as the Synthetic network. The MultiDW and SingleDiff are very competitive comparing to other baselines.
5.3.2. Wiki From the results reported in Table 4, we can see that our method outperforms the baseline models in terms of both MicroF1 and Macro-F1. Obviously, the performance gain of diffusion based embedding in Micro-F1 is more prominent. It is noticeable that spectral clustering and node2vec are very competitive, and the diffusion based embedding only obtains marginal improvements compared to the baseline models. The possible reason is that Wiki network is a relatively dense network. However, the diffusion based embedding focuses on discovering the latent struc-
526
Y. Shi, M. Lei and H. Yang et al. / Pattern Recognition 88 (2019) 518–531 Table 6 Results on Cora. Measure
Methods
10%
20%
30%
40%
50%
60%
70%
80%
90%
Micro-F1
Diffusion MultiDW SingleDiff DeepWalk node2vec GraRep LINE spectral
0.7695 0.7526 0.6338 0.6886 0.7515 0.5217 0.6763 0.5218
0.7846 0.7736 0.7344 0.7243 0.7767 0.6023 0.7120 0.6101
0.7893 0.7821 0.7694 0.7459 0.7790 0.6497 0.7452 0.6585
0.7970 0.7891 0.7817 0.7600 0.7843 0.6841 0.7501 0.6862
0.8094 0.7966 0.7909 0.7684 0.7966 0.7086 0.7673 0.7058
0.8136 0.7979 0.7957 0.7675 0.8031 0.7288 0.7699 0.7164
0.8173 0.8025 0.7963 0.7817 0.8044 0.7415 0.7732 0.7271
0.8191 0.8044 0.7981 0.8059 0.8062 0.7418 0.7767 0.7335
0.8265 0.8068 0.8081 0.8084 0.8118 0.7413 0.7780 0.7417
Macro-F1
Diffusion MultiDW SingleDiff DeepWalk node2vec GraRep LINE spectral
0.7586 0.7394 0.6016 0.6683 0.7433 0.4647 0.6579 0.4643
0.7769 0.7613 0.7142 0.7085 0.7604 0.5765 0.6840 0.5869
0.7881 0.7719 0.7546 0.7302 0.7621 0.6349 0.7162 0.6452
0.7924 0.7784 0.7714 0.7447 0.7771 0.6712 0.7262 0.6768
0.7991 0.7794 0.7828 0.7536 0.7842 0.6981 0.7293 0.6975
0.8007 0.7821 0.7860 0.7686 0.7897 0.7219 0.7368 0.7082
0.8075 0.7832 0.7863 0.7718 0.7927 0.7333 0.7494 0.7166
0.8097 0.7915 0.7885 0.7962 0.7961 0.7361 0.7503 0.7254
0.8114 0.7978 0.7897 0.7996 0.8012 0.7370 0.7506 0.7423
Bold: The best performances among all models.
tures of the networks. The more dense the network is, the less the invisible structures exist. As a result, the diffusion based model has no particular superiority over some baseline models. For MultiDW, we observe that the diffusion node samplings improve the performance in low training rate. The same results can be seen in node2vec and LINE which also attempt to improve the local information samplings. For SingleDiff, the inclusion of global information increases the performance in high training rate. The above observations demonstrate that the local information is important to the stability of the performance and the global information can help improve the best performance. 5.3.3. Citeseer As indicated in Table 5, our method outperforms other baseline models in terms of Macro-F1 score and Micro-F1 score. Compared with DeepWalk and GraRep, our method achieves stable results when varying the percentage of labeled nodes. Compared with node2vec and LINE, our method obtains significant improvements when training rate exceeds 80%. The SingleDiff and MultiDW obtain different results in Macro-F1 and Micro-F1. It indicates that the combination of local information and global information are more effective. 5.3.4. Cora We can observe from Table 6 that our method achieves better performance in terms of both Macro-F1 and Micro-F1 measures. Except for node2vec, the diffusion embedding has distinct advantages over the baseline models. The diffusion based embedding and node2vec are stable with respect to all training rates. Comparing with node2vec, our model performs better in at least 1% in all training rates. Similarly, The MultiDW model is more stable than the SingleDiff model. 5.4. A Case Study in DBLP In this subsection, we launch a case study on DBLP data to test the robustness of our method in a highly biased network. In this paper, the highly biased network is defined based on the degree distribution of the nodes. The overall connections of a biased network are very sparse except in a few high-degree nodes. Those exceptional nodes are considered as important nodes which have much more neighbors than other nodes. The random walk in graph sampling [15] is easily biased towards high-degree nodes. In our experiment, we construct a biased network of DBLP in which the structures are highly unbalanced. As illustrated in Fig. 3, about 90% of nodes have less or equal to
five connections. However, several nodes have extremely high degrees. We compare the new proposed diffusion model with DeepWalk and node2vec in the classification task to see whether the biased structure will affect the performances. The variations of Micro-F1 and Macro-F1 with the percentage of labeled nodes are plotted in Fig. 4. The results show that in highly unbalanced network, the diffusion model is less affected by the biased structures compared with DeepWalk and node2vec. Even the node2vec has revised the random walks used in DeepWalk, it also suffers performance loss in the DBLP network. This is probably because that, by using more local information and combining global information, our model can eliminate the impact of biased structures to some extent. 5.5. Parameters and analysis In this section, we compare our method with DeepWalk to analyze the parameter sensitivity of our model. As demonstrated in the previous sections, the time window Tc is the time length we choose to observe in the diffusion process, the number of cascades τ for each node is related to the total amount of training corpus. Since a cascade without time stamps is much similar to the random walk sequence, we bridge the correspondence of Tc and τ with the walk length Tw and the number of walks for each node γ in DeepWalk respectively. Stochastic methods need plenty of instances to simulate the true distribution. Consequently, they are more easily suffered performance loss due to insufficient training samples. As in network embedding, the graph sampling methods highly depend on the number of samplings since more sequences are more likely to detect the true structures of the network. At the very beginning, we consider an extreme case in which the number of cascades (walks) for each node is 1 and the time window (walk length) is 10. We plot the Macro-F1 and Micro-F1 for Wiki network, Cora network and Citeseer network. As illustrate in Fig. 5, our model achieves better performance than DeepWalk. Even with small number of cascades, the diffusion based embedding is also capable to detect valid network structures. One explanation of this phenomenon is that methods such as DeepWalk focus on local structure detection and therefor require repeated samplings on each node to guaranty the network structures could be captured adequately. However, the diffusion based model considers global information and devotes efforts to recovering the latent structures. Therefore, it is more robust to the sampling frequency in each vertex. In the second case, the time window (walk length) is set to be 40, and the number of cascades (walks) for each node is var-
Y. Shi, M. Lei and H. Yang et al. / Pattern Recognition 88 (2019) 518–531
527
Fig. 3. The degree histogram and network structure of the DBLP network.
Fig. 4. The classification results in highly biased DBLP network.
ied from 1 up to 80. We record Macro-F1 and Micro-F1 in Cora dataset and Citeseer dataset when training set ratio is fixed as 0.8. Fig. 6 shows that our model is better performed than DeepWalk and achieves more stable performance when the number of cascades exceeds 10. Finally, we list the performance on the Citeseer network by varying the dimensions of the learned representations. We choose the sampling ratio (0.1, 0.2, 0.5, 0.9) as examples to report the classification results. As shown in Fig. 7, the results improve significantly when the dimensions is less than 128 and become stable afterwards.
5.6. Scalability The running time of the diffusion sampling is relevant to the number of nodes. For the node classification task, the maximal time and minimal time of sampling occur in Citeseer network and Synthetic network. The maximal running time is under the scale of 101 second. To see our sampling process in a more general way, we test the scalability of the proposed model in random regular graphs by varying their sizes from 100 to 1,0 0 0,0 0 0 with default parameter settings. The random regular graphs specify the degree of each node. We set the degree of each node to be 10. The net-
528
Y. Shi, M. Lei and H. Yang et al. / Pattern Recognition 88 (2019) 518–531
Fig. 5. An extreme example when the number of samplings on each node is 1.
Y. Shi, M. Lei and H. Yang et al. / Pattern Recognition 88 (2019) 518–531
Fig. 6. The comparison results of DeepWalk and Diffusion when varying the number of samplings for each node (Training ratio: 0.8).
Fig. 7. The classification performances on Citerseer with different representation sizes of the learned representations: T c = 10, τ = 1.
529
530
Y. Shi, M. Lei and H. Yang et al. / Pattern Recognition 88 (2019) 518–531
as an end-to-end process will relieve the computation issue in the network inference step. The weight matrix can be considered as an indicator of the importance of neighbors. We can use the weight matrix as a sampling tool in deep graph models to relief the computation cost. Acknowledgment This work was supported by the National Natural Science Foundation of China [Grant No. 11331012, No. 91546201, No. 71331005, No. 71110107026, No. 11671379], University of Chinese Academy of Sciences Grant [No. Y55202LY00]. References
Fig. 8. The scalability of diffusion based network embedding in random regular graphs.
work inference used in [17] can be easily applied in small graphs and requires computer clusters for large graphs to obtain results in a reasonable time. To this end, we only report the scalability of our sampling step. In Fig. 8, the sampling phase of our model has linear scalability. Note that our method has extra time sampling procedure and includes more nodes in node sampling procedure. However, the sampling time still remains less than a hour in 1,0 0 0,0 0 0 nodes graph. The modification of the sampling strategy brings no additional time cost and inherits the quality of efficient sampling of previous random walk methods [12,33]. 6. Conclusion In this paper, we studied the stochastic network embedding as an optimization problem based on latent structure searching. Compared with previous stochastic models, our diffusion based network embedding takes into account high-level latent structures. This perspective indicates what searching strategy will generate useful and comprehensive information. For previous stochastic models, the latent structures are the node proximity in the same node sequence generated by random walk samplings. They only consider shallow latent structures which are also know as first-order proximity and second-order proximity as defined in LINE [35]. In contrast, for diffusion based model, the latent structures are revealed in the information cascades generated by diffusion samplings. The network inference returns accurate weights of node pairs and extends the positive-negative sampling strategy to a continuous and accurate way. For example, we observe that the random walk sampling depends on the streaming of large number of random walks to offset the loss of information. This makes the diffusion process more suitable to detect local information. Our analysis on parameters proved that, with a few samplings, the diffusion model is still competitive. Additionally, the network inference is a global process which searches the latent structures under the whole network. The case study in DBLP illustrated that the global process can reduce the bias in highly unbalanced networks. The results show that, for dense networks such as Wiki, the diffusion embedding has marginal advantages over baselines. As a future work, it is necessary to improve the sparsity of the weight matrix by introducing new norms in the loss function of network inference. Further extensions of the proposed method mainly rely on reducing of computation cost in the diffusion embedding procedure. For example, encoding the cascades into the representations
[1] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, S. Lin, Graph embedding and extensions: a general framework for dimensionality reduction, IEEE Trans. Pattern Anal. Mach. Intell. 29 (1) (2007) 40–51. [2] S. Chang, W. Han, J. Tang, G.-J. Qi, C.C. Aggarwal, T.S. Huang, Heterogeneous network embedding via deep architectures, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2015, pp. 119–128. [3] D. Wang, P. Cui, W. Zhu, Structural deep network embedding, in: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2016, pp. 1225–1234. [4] L.F. Ribeiro, P.H. Saverese, D.R. Figueiredo, struc2vec: Learning node representations from structural identity, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2017, pp. 385–394. [5] Q. Lu, L. Getoor, Link-based classification, in: Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003, pp. 496–503. [6] J. Gibert, E. Valveny, H. Bunke, Graph embedding in vector spaces by node attribute statistics, Pattern Recognit 45 (9) (2012) 3072–3083. [7] L. Akoglu, M. McGlohon, C. Faloutsos, Oddball: spotting anomalies in weighted graphs, Adv. Knowl. Discovery Data Min. (2010) 410–421. [8] Y. Xie, M. Gong, S. Wang, B. Yu, Community discovery in networks with deep sparse filtering, Pattern Recognit. 81 (2018) 50–59. [9] A. Reihanian, M.-R. Feizi-Derakhshi, H.S. Aghdasi, Overlapping community detection in rating-based social networks through analyzing topics, ratings and links, Pattern Recognit. 81 (2018) 370–387. [10] T. Wang, Y. Chen, Z. Zhang, T. Xu, L. Jin, P. Hui, B. Deng, X. Li, Understanding graph sampling algorithms for social network analysis, in: Distributed Computing Systems Workshops (ICDCSW), 2011 31st International Conference on, IEEE, 2011, pp. 123–128. [11] M. De Choudhury, Y.-R. Lin, H. Sundaram, K.S. Candan, L. Xie, A. Kelliher, et al., How does the data sampling strategy impact the discovery of information diffusion in social media? ICWSM 10 (2010) 34–41. [12] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2016, pp. 855–864. [13] T. Mikolov, K. Chen, G.S. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space, CoRR abs/1301 (2013) 3781. [14] W.L. Hamilton, R. Ying, J. Leskovec, Representation Learning on Graphs: Methods and Applications, IEEE Data Eng. Bull 40 (2017) 52–74. [15] M. Gjoka, M. Kurant, C.T. Butts, A. Markopoulou, Walking in facebook: A case study of unbiased sampling of osns, in: Infocom, 2010 Proceedings IEEE, IEEE, 2010, pp. 1–9. [16] M. Gomez Rodriguez, J. Leskovec, A. Krause, Inferring networks of diffusion and influence, in: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2010, pp. 1019–1028. [17] M. Gomez Rodriguez, D. Balduzzi, B. Schölkopf, G.T. Scheffer, et al., Uncovering the Temporal Dynamics of Diffusion Networks, in: 28th International Conference on Machine Learning (ICML 2011), International Machine Learning Society, 2011, pp. 561–568. [18] S. Myers, J. Leskovec, On the convexity of latent social network inference, in: Advances in neural information processing systems, 2010, pp. 1741–1749. [19] Y. Zhang, T. Lyu, Y. Zhang, Cosine: Community-preserving social network embedding from information diffusion cascades., AAAI, 2018. [20] M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, in: NIPS, 14, 2001, pp. 585–591. [21] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [22] J.B. Tenenbaum, V. De Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323. [23] M.M. Luqman, J.-Y. Ramel, J. Lladós, T. Brouard, Fuzzy multilevel graph embedding, Pattern Recognit. 46 (2) (2013) 551–565. [24] S. Cao, W. Lu, Q. Xu, Grarep: Learning graph representations with global structural information, in: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, ACM, 2015, pp. 891–900. [25] A. Robles-Kelly, E.R. Hancock, A riemannian approach to graph embedding, Pattern Recognit. 40 (3) (2007) 1042–1056.
Y. Shi, M. Lei and H. Yang et al. / Pattern Recognition 88 (2019) 518–531 [26] A. Maronidis, A. Tefas, I. Pitas, Subclass graph embedding and a marginal fisher analysis paradigm, Pattern Recognit. 48 (12) (2015) 4024–4035. [27] S.F. Mousavi, M. Safayani, A. Mirzaei, H. Bahonar, Hierarchical graph embedding in vector space by graph pyramid, Pattern Recognit. 61 (2017) 245–254. [28] B. Luo, R.C. Wilson, E.R. Hancock, Spectral embedding of graphs, Pattern Recognit. 36 (10) (2003) 2213–2230. [29] E.Z. Borzeshi, M. Piccardi, K. Riesen, H. Bunke, Discriminative prototype selection methods for graph embedding, Pattern Recognit. 46 (6) (2013) 1648–1657. [30] A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, A.J. Smola, Distributed large-scale natural graph factorization, in: Proceedings of the 22nd international conference on World Wide Web, ACM, 2013, pp. 37–48. [31] X. Bai, C. Yan, H. Yang, L. Bai, J. Zhou, E.R. Hancock, Adaptive hash retrieval with kernel based similarity, Pattern Recognit. 75 (2018) 136–148. [32] X. Bai, H. Yang, J. Zhou, P. Ren, J. Cheng, Data-dependent hashing based on p-stable distribution, IEEE Trans. Image Process. 23 (12) (2014) 5033–5046. [33] B. Perozzi, R. Al-Rfou, S. Skiena, Deepwalk: Online learning of social representations, in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2014, pp. 701–710. [34] T. Lyu, Y. Zhang, Y. Zhang, Enhancing the network embedding quality with structural similarity, in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, ACM, 2017, pp. 147–156. [35] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, Q. Mei, Line: Large-scale information network embedding, in: Proceedings of the 24th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, 2015, pp. 1067–1077. [36] K. ItoandH, P. McKean Jr, Diffusion processes andtheir sample paths, 1965. [37] M.E. Newman, Spread of epidemic disease on networks, Phys. Rev. E 66 (1) (2002) 016128. [38] D.J. Watts, P.S. Dodds, Influentials, networks, and public opinion formation, J. Consum. Res. 34 (4) (2007) 441–458. [39] E. Abrahamson, L. Rosenkopf, Social network effects on the extent of innovation diffusion: a computer simulation, Organ. Sci. 8 (3) (1997) 289–309. [40] D. Kempe, J. Kleinberg, É. Tardos, Maximizing the spread of influence through a social network, in: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2003, pp. 137–146. [41] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, M. Hurst, Patterns of cascading behavior in large blog graphs, in: Proceedings of the 2007 SIAM international conference on data mining, SIAM, 2007, pp. 551–556. [42] J. Leskovec, A. Singh, J. Kleinberg, Patterns of influence in a recommendation network, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2006, pp. 380–389. [43] J. Leskovec, L.A. Adamic, B.A. Huberman, The dynamics of viral marketing, ACM Trans. Web (TWEB) 1 (1) (2007) 5. [44] K. Jia, L. Sun, S. Gao, Z. Song, B.E. Shi, Laplacian auto-encoders: an explicit learning of nonlinear data manifold, Neurocomputing 160 (2015) 250–260. [45] Yiyi Liao, Yue Wang, Yong Liu, Image Representation Learning Using Graph Regularized Auto-Encoders, CoRR abs/1312 (2013) 0786. [46] F. Tian, B. Gao, Q. Cui, E. Chen, T.-Y. Liu, Learning deep representations for graph clustering., in: AAAI, 2014, pp. 1293–1299. [47] S. Cao, W. Lu, Q. Xu, Deep neural networks for learning graph representations., in: AAAI, 2016, pp. 1145–1152.
531
[48] O. Levy, Y. Goldberg, Neural word embedding as implicit matrix factorization, in: Advances in neural information processing systems, 2014, pp. 2177–2185. [49] Y. Koren, R. Bell, C. Volinsky, Matrix Factorization Techniques for Recommender Systems, Computer 8 (42) (2009) 30–37, doi:10.1109/MC.2009.263. [50] C. Yang, Z. Liu, D. Zhao, M. Sun, E.Y. Chang, Network representation learning with rich text information., in: IJCAI, 2015, pp. 2111–2117. [51] C. Tu, W. Zhang, Z. Liu, M. Sun, Max-margin deepwalk: Discriminative learning of network representation., in: IJCAI, 2016, pp. 3889–3895. [52] M. Ou, P. Cui, J. Pei, Z. Zhang, W. Zhu, Asymmetric transitivity preserving graph embedding, in: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2016, pp. 1105–1114. [53] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, T. Eliassi-Rad, Collective classification in network data, AI Mag. 29 (3) (2008) 93. [54] Y. Sun, Y. Yu, J. Han, Ranking-based clustering of heterogeneous information networks with star network schema, in: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2009, pp. 797–806. [55] J. Yang, J. Leskovec, Defining and evaluating network communities based on ground-truth, Knowl. Inf. Syst. 42 (1) (2015) 181–213. [56] A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm, in: Advances in neural information processing systems, 2002, pp. 849–856. [57] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, Liblinear: a library for large linear classification, J. Mach. Learn. Res. 9 (Aug) (2008) 1871–1874. Yong Shi received the B.S. degree in mathematics from Southwest Petroleum Institute, Chengdu, China, in 1982, and the Ph.D. degree in management science from the University of Kansas, Lawrence, KS, USA, in 1991. He is currently the Director of the Key Laboratory of Big Data Mining and Knowledge Management, and the Director of the Research Center on Fictitious Economy & Data Science, Chinese Academy of Sciences, Beijing, China. His current research interests include data mining, and multiple criteria decision making. Minglong Lei received his M.S. degree in computer science and technology from The 6th Research Institute of China Electronics Corporation, Beijing, China, in 2016. He is currently working on the Ph.D. degree in Chinese Academy of Sciences, Beijing, China. His current interests include machine learning and data mining. Hong Yang received her bachelor degree from Xi’dian University and her master degree from the Chinese Academy of Sciences, both majored in Communications and Information Systems. Since then, she worked at MathWorks as an algorithm and software training engineer. Currently she is a Ph.D. student at the Centre of Artificial Intelligence, University of Technology Sydney, Australia. Her research interests include data analytics and machine learning. Lingfeng Niu received the Ph.D. degree in mathematics from the Chinese Academy of Sciences, Beijing, China, in 2009, and the B.S. degree in mathematics from Xi’an Jiaotong University, Xi’an, China, in 2004, respectively. She has been an Associate Professor with the Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences since 2009. Her current research interests include optimization and machine learning.