Variance reduction in large graph sampling

Variance reduction in large graph sampling

Information Processing and Management 50 (2014) 476–491 Contents lists available at ScienceDirect Information Processing and Management journal home...

2MB Sizes 0 Downloads 63 Views

Information Processing and Management 50 (2014) 476–491

Contents lists available at ScienceDirect

Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman

Variance reduction in large graph sampling Jianguo Lu ⇑, Hao Wang School of Computer Science, University of Windsor, 401 Sunset Avenue, Windsor, Ontario N9B 3P4, Canada

a r t i c l e

i n f o

Article history: Received 25 March 2013 Received in revised form 7 January 2014 Accepted 11 February 2014 Available online 15 March 2014 Keywords: Uniform random sampling Random walk Graph sampling Online social network Scale-free network Harmonic mean

a b s t r a c t The norm of practice in estimating graph properties is to use uniform random node (RN) samples whenever possible. Many graphs are large and scale-free, inducing large degree variance and estimator variance. This paper shows that random edge (RE) sampling and the corresponding harmonic mean estimator for average degree can reduce the estimation variance significantly. First, we demonstrate that the degree variance, and consequently the variance of the RN estimator, can grow almost linearly with data size for typical scale-free graphs. Then we prove that the RE estimator has a variance bounded from above. Therefore, the variance ratio between RN and RE samplings can be very large for big data. The analytical result is supported by both simulation studies and 18 real networks. We observe that the variance reduction ratio can be more than a hundred for some real networks such as Twitter. Furthermore, we show that random walk (RW) sampling is always worse than RE sampling, and it can reduce the variance of RN method only when its performance is close to that of RE sampling. Crown Copyright Ó 2014 Published by Elsevier Ltd. All rights reserved.

1. Introduction The data on the Web or online social networks can be often viewed as a graph. The graph in its entirety may not be available for various reasons. It can be distributed over many machines (e.g., the Web and P2P networks), hidden behind searchable interfaces (e.g., search engines and online social networks), scattered among a larger graph (e.g., various communities in online social networks). Regardless of the causes, a common challenge is to reveal the properties of such graphs when we do not own the entire data. In the past, extensive research was carried out to explore the profile of search engines (Lawrence & Giles, 1998) and other data collections (Broder, 2006; Callan & Connell, 2001; Si & Callan, 2003). Most of them focused on obtaining uniform random node (RN) samples, such as uniform random web pages from the Web (Henzinger, Heydon, Mitzenmacher, & Najork, 2000) and search engines (Bar-Yossef & Gurevich, 2008), and uniform random bloggers from online social networks (Gjoka, Kurant, Butts, & Markopoulou, 2009). Once uniform random samples are obtained, network properties, in particular the attributes of the nodes including average degree, could be estimated with statistical guarantee. In many cases, RN sampling works only in theory. The majority of real world networks are scale-free (Barabási & Albert, 1999), whose degree distributions follow a power law. Such scale-free networks often induce a large variance of the degrees. In theory, the variance does not exist when the exponent of the power law falls in certain range. In practice, the variance can be extremely high for very large networks. For instance, the coefficient of variation of the Twitter user network collected in 2009 (Kwak, Lee, Park, & Moon, 2010) is as high as 35.95. To understand the impact of such a high coefficient of variation, let us have a quick calculation for the sample size needed to reach 20% accuracy for its average degree 70.51. More precisely, to ⇑ Corresponding author. Tel.: +1 519 253 3000. E-mail addresses: [email protected] (J. Lu), [email protected] (H. Wang). http://dx.doi.org/10.1016/j.ipm.2014.02.003 0306-4573/Crown Copyright Ó 2014 Published by Elsevier Ltd. All rights reserved.

J. Lu, H. Wang / Information Processing and Management 50 (2014) 476–491

477

make sure that the estimation is within the range of 70:51  14:10 with 95% confidence, the relative standard error RSE should be around 0.1, and the required sample size n ¼ 35:952  =ð0:1Þ2 ¼ 129; 240. In addition, uniform random samples are obtained with high cost because they are not provided directly by the data sources. Costly sampling methods, such as rejection sampling, have to be employed to obtain uniform samples. In the process, many samples are retrieved and rejected as invalid. The actual samples retrieved are many times larger than 129; 240, depending on the sampling methods allowed. Considering the network traffic involved and the daily quota imposed by the service provider, it is prohibitive to use uniform random sampling to obtain meaningful estimations. With increasingly more applications of big data analyses, there is an urgent need to find a method to reduce such a large variance. Recent developments made empirical observations that simple random walk (RW) sampling or its extensions can improve degree estimator performance for P2P networks (Rasti, 2009), Facebook user network (Gjoka et al., 2009), Twitter user network (Lu & Li, 2012), and term-document bipartite graphs (Wang, Liang, & Lu, in press). Similar empirical observations are made for node size estimation (Katzir, Liberty, & Somekh, 2011; Kurant, Butts, & Markopoulou, 2012). We find that these observations are data dependent. Random walk can be much worse than uniform random sampling for other datasets, even when the graph is scale-free and the variance is very high as we will show in Section 4.3. We find that it is random edge (RE) sampling, not RW sampling, that reduces the variance for graphs with large degree variation. In addition to this empirical observation, we explore the reason why RE outperforms RN sampling, and why RW does not. While it is easy to understand that uniform random sampling does not work well for scale-free networks, it was not clear whether RE sampling works better. This paper shows that the variance of the RE estimator is bounded from above by a polynomial in the average degree and sample size. It implies that the performance of RE sampling does not deteriorate with the growth of degree variance of the graph, thereby it guarantees the superiority of RE sampling when degree variance is large. This result is particularly important for large graphs whose variance becomes larger compared with smaller data with the same distribution. Improvement ratios as high as 100 are observed on Twitter and other networks. Such a large gap has implications for both practitioners and researchers. Practitioners can greatly save the estimation effort and give a worst case error bound. Researchers can devise new sampling methods that approximate RE sampling when it is not directly supported by the data source under investigation. For instance, random walk with restart (Avrachenkov, Ribeiro, & Towsley, 2010) can succeed because it is similar to RE sampling and exploits the large gap between RN and RE sampling. The major contribution of the paper is our development of the upper bound of the variance of RE estimator. The result holds independent of degree distribution and graph topology. We verify the result using both simulated datasets and 18 real world networks. A direct consequence of the upper bound is the improvement ratio between RN and RE methods. To illustrate that the ratio can be very large, we first demonstrate that degree variance (consequently RN estimator variance) can be 2 in the order of OðN=ln NÞ under the assumption of the power law distribution. Now that RE variance is upper bounded, the improvement ratio tends to be infinite when data size goes infinitely large. Finally, we show that random walk sampling can approximate the performance of RE sampling only when the conductance of the graph is not very small, or, when the graph is well-enmeshed. In the following sections, we first introduce the background of the research in Section 2, including the sampling methods and their corresponding estimators, the related work, and its applications. Then in Section 3 we derive the variances of RN and RE estimators. By giving the upper bound of the variance of the RE estimator, we quantify the performance ratio between RN and RE methods. In Section 4, we verify our result on 18 real networks, and demonstrate that the performance of RW sampling depends on both degree variance and graph conductance. 2. Background and related work 2.1. RN, RE, and RW sampling Given an undirected graph GðV; EÞ, where V is the set of nodes, and E the set of edges. Let jVj ¼ N. Nodes are labeled as P 1; 2; . . . ; N, and their corresponding degrees are d1 ; d2 ; . . . ; dN . The volume of the graph is s ¼ Ni¼1 di , the average degree is PN 1 2 hdi ¼ N i¼1 di ¼ s=N. The variance r of the degrees in the population is defined as:

r2 ¼ hd2 i  hdi2 ; 2

ð1Þ

PN

2 i¼1 di =N

where hd i ¼ is the second moment, i.e., the arithmetic mean of the square of the degrees in the total population. The coefficient of variation (denoted as c) is defined as the standard deviation, or the square root of the variance, normalized by the mean of the degrees:

c2 ¼

r2 hdi

2

2

¼

hd i hdi2

 1:

ð2Þ

Suppose that a sample of n elements ðdx1 ; . . . ; dxn Þ is taken from the population, where xi 2 f1; 2; . . . ; Ng for i ¼ 1; 2; . . . ; n. Our task is to estimate the average degree hdi using the sample. Table 1 summarizes the notations used in this paper. There are different ways to take the samples, notably by RN, RE, and RW samplings. In RN sampling, each node is sampled uniformly at random with replacement. In RE sampling, edges are selected with equal probability and two nodes incident to

478

J. Lu, H. Wang / Information Processing and Management 50 (2014) 476–491 Table 1 Summary of notations. Notation

Meaning

N n di

Population size Sample size Degree of node i Volume of all the nodes

s

Properties

dxj pi

Degree of the j th sampled node

hdi

Mean degree Mean of the squared degrees

P s ¼ Ni¼1 di ¼ Nhdi xj 2 f1; 2; . . . ; Ng PN pi ¼ di =s; i¼1 pi ¼ 1 hdi ¼ s=N P 2 2 hd i ¼ N i¼1 di =N

Probability of node i being visited

2

hd i

Variance of the degrees

r2 c2 E

hd i

Coefficient of variation

r2 ¼ hd2 i  hdi2 c2 ¼ r2 =hdi2 ¼ hd2 i=hdi2  1

Asymptotic mean degree of RE sampling

hd i ¼ hd i=hdi

E

2

a random edge are collected. In this way, RE sampling is a kind of PPS (probability proportional to size) sampling in that each node is sampled with probability proportional to its degree. RW sampling selects the next node in the current neighbourhood uniformly at random. Its node selection probability is proportional to the degree asymptotically. Different sampling methods require different estimators. The arithmetic mean is an unbiased estimator for RN sampling:

cRN ¼ 1 hdi n

n X dxi ;

ð3Þ

i¼1

In RE or PPS sampling, the arithmetic mean estimator tends to overestimate the average degree hdi by a factor of ðc2 þ 1Þ. Instead, the harmonic mean should be used for these samples:

" #1 n X 1 cRE ¼ n hdi : d i¼1 xi

ð4Þ

We refer to Salganik and Heckathorn (2004) for the detailed derivation of this estimator. RW sampling assumes that the sampling probability is proportional to the degree, therefore the same estimator is used. What we are interested in this paper is the variance of the RE estimator, particularly the comparison with the variance of the RN estimator. The sampling and estimation methods can be illustrated using Fig. 1. The average degree of the graph is 2. The sample degrees taken by RN, RE, and RW sampling methods are (1, 1, 1, 1, 2, 8), (1, 8, 1, 8, 2, 4), and (4, 3, 8, 1, 8, 1), respectively. The estimations for RN, RE, and RW samples are:

c ¼ 1 þ 1 þ 1 þ 1 þ 2 þ 8 ¼ 2:5; hdi RN 6 6 c hdi RE ¼ 1 1 1 1 1 1  2; þ8þ1þ8þ2þ4 1 6 c ¼ hdi  2:11: RW 1 1 1 1 þ þ þ þ 18 þ 11 4 3 8 1 1 1 8

1

1

1 1

1

1

3

4

1

8

1 1

1 1

1

1

1

4

1 1

Random Edge

1

1 2

1 1

3

4

Random Node

1 8

1

3

1

2

Original graph

1

1 1

1

8

1 1

2

1 1 3

4

1

1 1

2

Random Walk

Fig. 1. A graph and three sampling methods to select six sample nodes. The three sampling methods are random node (RN), random edge (RE), and random walk (RW). Nodes can be sampled multiple times as shown in sub-figures for RE and RW samplings.

J. Lu, H. Wang / Information Processing and Management 50 (2014) 476–491

479

To develop intuition about the high variance of RN sampling, and the variance reduction enabled by RE sampling to be discussed in the next section, consider a pedagogical example depicted in Fig. 2. It is a star graph that has a large node connecting with every other node (degree = N  1), while all the remaining (N  1) nodes connect with the large node only (degree = 1). Such a graph in a much larger scale is also found as a subgraph in the real NotreDame web graph as shown in Fig. 11. The average degree is ðN  1 þ N  1Þ=N  2, assuming 1=N  0. Most of the uniform random samples will include the small nodes only, even when the sample size is close to N. Thus most of the estimations will be 1, while occasionally there are very large estimations when the large node is sampled. When RE sampling is used, both small and large nodes are sampled, resulting in sampled degree sequence ð1; N  1; 1; N  1; . . .Þ. For these sampled degrees, the sample mean is N=2, which over estimates grossly because a node is sampled with the probability proportional to its degree. Such samples need a different estimator, i.e., the harmonic mean instead of arithmetic mean. The harmonic mean of four sample degrees is 4=ð1 þ 1=ðN  1Þ þ 1 þ 1=ðN  1ÞÞ  2. This approximates the true value very well. 2.2. Pertinent work Graph sampling has been widely studied (Leskovec & Faloutsos, 2006; Wang et al., 2011), and finds its applications in online social networks (Gjoka et al., 2009; Papagelis, Das, & Koudas, 2013; Dasgupta, Kumar, & Sivakumar, 2012; Ribeiro & Towsley, 2010), real social networks (Salganik & Heckathorn, 2004; Wejnert & Heckathorn, 2008), web graphs (Henzinger et al., 2000), search engine indexes (Bar-Yossef & Gurevich, 2008), and deep web data sources (Lu & Li, 2010). The norm of the practice is to use uniform random samples whenever possible. Quite often, uniform random sampling is not directly supported. Other methods, such as Metropolis Hasting Random Walk (MHRW) (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953) and rejection sampling, are utilized to approximate uniform random sampling (Bar-Yossef & Gurevich, 2008). When uniform random samples are not available, numerous sampling methods are proposed, in particular RW (Lovász, 1993) for unequal probability sampling. Recently, there are empirical observations that RW sampling can outperform MHRW sampling (Rasti, 2009; Gjoka et al., 2009). Although MHRW does produce uniform random samples, it incurs additional cost, and is not the same as the direct RN sampling. Therefore, it is easier to observe that RW can be better than MHRW sampling. Rasti et al. observed that random walk sampling can outperform MHRW in the context of peer-to-peer networks Rasti (2009). Gjoka et al. showed that RW (called re-weighted random walk in their paper) and MHRW are comparable Gjoka et al. (2009). This paper claims that, it is RE sampling that is better than RN sampling for large and scale-free networks. RW sampling, on the other hand, is always inferior to RE sampling, and can be much worse than RE and RN sampling when the graph conductance is very small. Our earlier work on the comparison between RW and RN samplings on the Twitter data (Lu & Li, 2012) motivated the studies conducted in this paper. Lu and Li (2012) found that, on the Twitter data, RW sampling is much better than RN sampling. Wang et al. (in press) experimented RW sampling on bipartite graphs representing term-document relationship in search engines. On such bipartite graphs, the performance of RW sampling does not have an obvious advantage over RN sampling. Our further study on dozens of other datasets also generated mixed results. Thereby, this paper tries to answer the question as for when and why RW is better than RN sampling. We identified two orthogonal factors influencing the sampling method: the degree variation and the conductance. High degree variation will guarantee that RE sampling works well, and the lack of loosely connected components ensures that RW sampling can approximate RE sampling. The harmonic mean estimator was first derived and studied in depth by Salganik and Heckathorn (2004) to estimate the properties of hidden populations such as drug-addicts. The degree sampling of networks, which is the focus of this paper, has

Fig. 2. An illustrative example that favours random edge (RE) sampling.

480

J. Lu, H. Wang / Information Processing and Management 50 (2014) 476–491

also received special attention. Stump and Wiuf studied the sampling of degree distribution Stumpf and Wiuf (2005) for two sampling schemes, i.e., random sampling and the degree dependent sampling of the nodes. For average degree estimation, both (Feige, 2004 & Goldreich & Ron, 2004) used uniform random sampling of the nodes. Feige (2004) discussed the lower bound of the estimation. Based on this result, (Goldreich & Ron, 2004) proposed a sampling scheme that put more weight on the nodes that have less probability of being sampled. The impact of sampling methods on the discovery of graph properties has also been studied in Leskovec and Faloutsos (2006), Stumpf, Wiuf, and May (2005), Stumpf and Wiuf (2005), Lee, Kim, and Jeong (2006). They cover a wide range of network properties, and focus on the properties of the derived sub-graph, instead of the estimation of the properties of the original graph. For instance, (Leskovec & Faloutsos, 2006) investigated several network characteristics like the distribution of connected components. Lee et al. (2006) showed that random node sampling performs better than random edge sampling in approximating the clustering coefficient of the graph. 2.3. Why average degree estimation Graph properties need to be estimated when the graph in its entirety is not available. This happens when the graph is distributed without central data deposit, such as the Web, or, when it is hidden behind searchable web interfaces, such as search engines, online social networks, and millions of textual corpora hidden behind HTML search boxes. In either case the direct calculation of the property is impossible. Next, we highlight the importance of average degree estimation, and practical implications of RE sampling. Average degree is an important metric for any graph (Kolaczyk, 2009), and has many incarnations in the real world: when the graph is the Web, the average degree is the average number of in/out-links, which is an important property to characterize the Web (Broder et al., 2000); when the graph is an online social network, the average degree is the average number of friends, messages and followers (Gjoka et al., 2009); when the graph is a term-document network implemented by a search engine, the average degree is the average document size and average query matches (Callan & Connell, 2001; Bar-Yossef & Gurevich, 2008); when the graph is an email network, the average degree is the average number of email contacts. The applications can go beyond computer science to other disciplines such as finding the average degree (friends) of drug addicts (Salganik & Heckathorn, 2004). Furthermore, average degree estimation can be generalized as the problem of estimating the expectation of a random variable. RN and RE samplings correspond to uniform sampling and PPS sampling, respectively. Thus our results can be extended to other scenarios without a graph representation. In this paper, we discuss the problem in the setting of graph so that the RW sampling can be compared within the same framework. Besides, graph representation gives us a tangible illustration and straightforward implementation for the sampling process. What is more important is that average degree can be used to derive other properties such as the variance and the data 2

2

size. The variance hd i  hdi2 , or equivalently, c2 ¼ hd i=hdi2  1, is dependent on average degree hdi. More interestingly, it 2

E

E

can be estimated using c ¼ hd i=hdi  1, where hd i is the average degree of the samples obtained by RE sampling. c in turn b ¼ ðc2 þ 1Þ n2 (Lu & Li, 2013; Lu & Li, 2012), where n is the sample size, C is can be used to estimate the number of nodes by N 2C

the number of collisions in the samples. c2 can be also used to measure the ratio between the number of friends of your friends, and the number of your friends. As the saying goes, your friends have more friends than you do on average. To be more precise, your friends have c2 þ 1 times more friends than you do. Along the same line c2 can be used to quantify the diffusion of messages that is borrowed from epidemiology. In particular, it can be derived that the threshold for the 2

occurrence of large component, or the occurrence of epidemics (Jackson, 2008) (Eq. 7.8) is p ¼ ððcc2 þ1Þhdi2 , where p is the proþ1Þhdi1 portion of the nodes that are immuned uniformly from the network. 3. Variance reduction using RE sampling The performance of estimators can be evaluated in terms of bias and variance. The RN method is unbiased, and the RE method has a small bias that can be ignored compared with the variance when sample size is large (Salganik & Heckathorn, 2004). Therefore, we focus on the comparison of the variances of the two methods. The variance of the RN method depends on the variance of the degrees, which varies from data to data, and typically grows with the size of the data for a given degree distribution. On the other hand, we find that the variance of the RE method has an upper bound that does not depend on the degree variance. This upper bound guarantees the reduction of the variance of the RE method for very large data. We develop the upper bound in three steps: first we use V 1 , a value that is obtained from Taylor expansion, to approximate the RE estimator. We demonstrate that such approximation is accurate using simulated data and real networks of various topologies. V 1 contains the variance of the reciprocal of the sampled degrees, which is difficult to quantify and compare with. Therefore, V 2 is derived as an upper bound of V 1 , by exploiting the fact that all the degrees are greater than one, and that most of the nodes are of very small degrees in scale-free networks. In a typical scale-free network with exponent one, we show that V 2  2V 1 . In real networks, we observe that the ratio between V 2 and V 1 varies widely between 1.19 and 7.55. V 2 can be simplified further into V 3 when the average degree is large.

481

J. Lu, H. Wang / Information Processing and Management 50 (2014) 476–491

3.1. The large variance problem of RN sampling cRN for RN sampling is: The variance of the arithmetic mean estimator hdi

c Þ¼ v arð hdi RN

r2 n

¼

c2 hdi2 n

ð5Þ

;

where n is the sample size. Sometimes, it is easier to interpret the variance relative to its true value using RSE (relative standard error) as defined as below:

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi c cRN Þ ¼ v arð hdi RN Þ ¼ pcffiffiffi : RSEð hdi hdi n

ð6Þ

cRN is an unbiased estimator, its variance can be very large for some scale-free networks. The degrees of Although hdi most real life networks are close to Zipf’s distribution (Newman, 2005), inducing a large variation of the degrees. However, it is hard to quantify the variance exactly because real data do not fit exactly the Zipf’s law, and the exponent and cut-off value vary from data to data. Nonetheless, we can assume a distribution to gain some understanding of the variance. a When di follows the power law, di ¼ A=i , where A is a normalizing constant that satisfies N N X X a di ¼ A 1=i  AfðaÞ ¼ Nhdi; i¼1

i¼1

P a where fð:Þ is the Riemann-zeta function fðaÞ  Ni¼1 1=i based on the assumption that N is a very large number. That is, A ¼ Nhdi=fðaÞ. Note that this exponent a is for the degree-rank plot. The corresponding frequency-degree plot has slope ða þ 1Þ (Newman, 2005). Since the vast majority of networks have degree-frequency slope around 2 (Newman, 2010), in the following we derive the variance when the slope is exactly 2, i.e., a ¼ 1 in the degree-rank equation. Note that fð1Þ  ln N, and fð2Þ  1:6. Therefore, N N X X A di ¼ ¼ Afð1Þ  A ln N; i i¼1 i¼1

ð7Þ

N N X X A2 2 di ¼ ¼ A2 fð2Þ  1:6A2 : 2 i¼1 i¼1 i

ð8Þ

By the definition of variance, we can derive the variance of the degrees as below:

v arðdÞ ¼ hd2 i  hdi2 ¼

  2 N 1X 1 X 2 1:6A2 A2 ln N 2 2 1:6N : di  2  ¼ hdi  1 di ¼ 2 N i¼1 N N N2 ln N

ð9Þ

Therefore, when exponent a of the Zipf’s law is 1, the coefficient of variation is:

c2 ¼

1:6N 2

ln N

 1:

ð10Þ 2

The intuition of this equation is that the variance grows almost linearly with data size N, in the order of OðN=ln NÞ. The 2 sample size n needs to be in the order of OðN=ln NÞ so that satisfactory estimates can be obtained. When the data is very large, almost all the nodes need to be checked before an estimation can be made. That is equivalent to saying that the estimation is infeasible for very large scale-free graphs using uniform random sampling. For instance, Twitter has N  5  108 users in 2012. If the degree distribution followed the power law as we assumed, its 2 coefficient of variation would be c2 ¼ 1:6N=ln N ¼ 2:0  106 . According to Eq. (6), this means that we would need a sample 2 8 2 size n ¼ c =0:1 ¼ 2:0  10 so that the RSE could be 0.1. To achieve the 95% confidence interval 5  108  108 , the sample size is already in the same order of the total population. For the downloaded Twitter data in 2009 that contains 4:1  107 users, we find that the sample size needs to be 129,240 so that RSE is 0.1. To verify our derivations, we generate 10 synthetic datasets with the same distribution (a ¼ 1) but different data size N ranging between 105 and 106 . Panel (C) in Fig. 3 shows the observed c2 values along with the ones projected by Eq. (10). Clearly, c2 grows almost linearly with the data size, and the projection is rather accurate. Panels (A) and (B) plot the degree distributions when N ¼ 105 . Panel (A) is the degree-rank plot with slope 1, i.e., a ¼ 1. (B) is the frequency-degree plot for the same data. We can see that the slope of the frequency-degree plot is ða þ 1Þ ¼ 2. Panel (D) compares the estimator variance for RE and RN methods when the sample size is 100. It shows that, for the same data distribution, the performance of RE method remains almost constant. But compared with the increasing variance of RN method, RE sampling becomes better when N > 50; 000, and the advantage becomes larger with the increase of data size.

482

J. Lu, H. Wang / Information Processing and Management 50 (2014) 476–491

(A)

5

(B)

5

10

Degree

Frequency

10

0

0

10

0

10 0 10

5

10

10

5

10

Rank

Degree

(C)

1400

Var of RN and RE

1000

γ2

800 600 400 True value Estimated

200 0

(D)

12

1200

10 8 6 4 2

RN var RE var

0 0

2

4

6

8

Data size N

10

0

4

2

4

6

Data size N

x 10

8

10 4

x 10

Fig. 3. c2 grows almost linearly with data size N when the degree distributions are the same. Panel (A) Degree-rank log–log plot when N ¼ 105 . (B) The frequency-degree plot of the same data. (C) Observed c2 against the data size N, along with the projected c2 in Eq. (10). (D) Variances RN and RE samplings when sample size n = 100.

3.2. Variance of RE sampling Given a set of degrees fdx1 ; dx2 ; . . . ; dxn g obtained by RE sampling, recall that the harmonic mean estimator is:

" #1 n X 1 c hdi RE ¼ n : d i¼1 xi

ð11Þ

Let random variables v ¼ 1=dxi , and V ¼

Pn

1 1=dxi . By the harmonic mean estimator, we have Eðv Þ ¼ Eð1=dxi Þ ¼ 1=hdi, and cRE around EðVÞ. The EðVÞ ¼ n=hdi. Our first approximation to the variance is obtained by applying the Taylor expansion of hdi result is:

cRE ¼ n ¼ n 1  V  EðVÞ þ . . . hdi V EðVÞ EðVÞ2

! ð12Þ

cRE (denoted By applying the variance on the first two terms of the Taylor expansion, we obtain an approximate variance of hdi by V 1 ) as follows:

V1 ¼

n2 v arðVÞ EðVÞ

4

¼

hdi4 v arðv Þ : n

ð13Þ

Next, we need to find a bound for v arðv Þ. By the definition of variance,

v arðv Þ ¼ Eðv

2

2

Þ  ðEðv ÞÞ ¼ E

1 2

dxi

!



1 hdi2

Since dxi is obtained with probability pxi ¼ dxi =s, where

E PN

1 2 dxi

!

¼

N N N X X 1 1 1 X 1 pi 2 ¼ ¼ : Nhdi d s d i d i¼1 i¼1 i¼1 i i

ð14Þ

:



PN

i¼1 di ,

ð15Þ

P P varies from data to data, but a safe upper bound is N > Ni¼1 1=di since every di > 1. In scale-free networks, Ni¼1 1=di is not far away from N, as we will show in the simulation study and in real networks. Therefore we derive an upper bound for cRE Þ, which is called V 2 as defined below: v arð hdi i¼1 1=di

483

J. Lu, H. Wang / Information Processing and Management 50 (2014) 476–491

V2 ¼

! hdi4 1 1 :  hdi hdi2 n

ð16Þ

When hdi is a large number, the second term 1=hdi2 can be neglected, resulting in a simplified upper bound V 3 ,

V3 ¼

hdi4 1 hdi3 ¼ : n hdi n

ð17Þ

cRE Þ  V 1 < V 2 < V 3 . Thus, we derive the following theorem: In summary, v arð hdi Theorem 1. The upper bound for the variance of RE sampling is hdi3 =n. Or,

cRE Þ < v arð hdi

hdi3 : n

ð18Þ

We highlight two points regarding this result. First, the upper bound does not depend on the degree distribution or other topological characteristics like graph conductance. As long as the nodes are selected with probability proportional to their degrees, the result holds no matter whether it is a scale-free graph, or has tightly knit communities. On the other hand, the performance of RW sampling depends on graph conductance as we will explain in Section 4.3. Second, the upper bound is surprisingly simple, involving only average degree and sample size. Unlike the variance of RN sampling that is associated with the degree variance, the variance of RE sampling is bounded from above by a constant detercRN Þ ¼ hdi2 c2 . Comparing the variances for estimined by average degree. Recall that the variance for RN estimator is v arð hdi n c c mators hdi RN and hdi RE , we have: Corollary 1. RE sampling reduces the variance at least by a factor of c2 =hdi, i.e.,

cRN Þ c2 v arð hdi > : cRE Þ hdi v arð hdi

ð19Þ

3.3. Simulation studies To understand the relationship between the upper bound and the true variance, we demonstrate the variances using synthetic datasets whose degree distributions follow a power law di ¼ A=ðb þ iÞ, where A is a normalizing constant

(B)

12

V3 V2 V

Degree

Variances

10

1

8

(C)

2

10

Obs. var

6 4

Variances

(A)

5

10

V3 V2 V

1

1

10

Obs. var

0

10

2 0

−1

0

10

5

10 0

10

(D)

4

10

Variances

Degree

10

10

(F)

2

10

1

8

3

10

Sample size n

V 3 V 2 V

3

2

10

(E)

12

2

1000

Sample size n

Rank

10

500

Obs. var

6 4

1

Variances

0

10

V 3 V 2 V

1

1

10

Obs. var

0

10

10

2 0

10

−1

0 0

5

10

10

Rank

0

500

Sample size n

1000

10

2

3

10

10

Sample size n

Fig. 4. The upper bound V 3 and the observed variances, along with the approximations V 1 and V 2 . Sample sizes range between 100 and 1000. N ¼ 106 ; a ¼ 1; hdi ¼ 10. Observed variance is obtained from 100 repetitions.

484

J. Lu, H. Wang / Information Processing and Management 50 (2014) 476–491

P Nhdi= Ni¼1 1=ðb þ iÞ. This is called Zipf–Mandelbrot law (Montemurro, 2001) that can model the real data better than a di ¼ A=i . When b ¼ 0, it is reduced to the Zipf’s law described in Section 3.1. We experiment with two distributions, and report the results in Fig. 4. The first row is for the distribution with b ¼ 0, and the second row has b ¼ 100. For both distributions, we generate the degrees satisfying such distribution where the data size N ¼ 106 , and conduct RE sampling 100 times. From 100 repetitions, we record the variance of the RE estimator, along with its approximations V 1 ; V 2 and V 3 . Panels (A) and (D) are the degree-rank plots for the data, giving a visual understanding of the distribution. Panels (B) and (C) are the plots for the variances (V 1 ; V 2 ; V 3 , and observed variance) over various sample sizes ranging between 100 and 1000. Panels (C) and (D) are the corresponding log–log plots. We make the following observations from the simulation study:  V 3 vs. V 2 : According to our analysis in the previous subsection, the ratio between V 3 and V 2 should be that of 1=hdi and 1=hdi  1=hdi2 , which is hdi=ðhdi  1Þ. In this particular data, hdi ¼ 10, and we can see that the ratio between V 3 and V 2 are very close to 10/9 as expected. P  V 2 vs. V 1 : The difference between V 1 and V 2 is determined by the difference of Ni¼1 1=di and N. When b ¼ 0, N N X 1X ln N NðN  1Þ N 1=di ¼ i¼  : A Nhdi 2 2 i¼1 i¼1

ð20Þ

P That is, Ni¼1 1=di ¼ 0:50N. Therefore V 2 =V 1 is approximately two as shown in panels (B) and (C). PN When b ¼ 100; i¼1 1=di ¼ 0:29N, which is smaller than the previous case. Therefore, the gap between V 2 and V 1 is larger.  V 1 vs. Observed Variance: V 1 is obtained from Taylor expansion of 1=V by ignoring the third term in Eq. (12). While the approximation varies from data to data, it shows that the gap is really negligible in these two simulated data, and in 18 real network as we will show in Section 4. 4. Experiments on real networks 4.1. Datasets We conducted experiments on 18 real networks, most of them are from the Stanford SNAP graph collection (Leskovec & Faloutsos, 2006). Due to space limitation, for some network categories only one graph is reported if they have similar behaviour. For instance, citation graphs have similar degree distribution, similar coefficient of variation, and similar error ratios between RN, RE, and RW samplings. For these categories, we choose only one graph for each category. In the category of the Web graph datasets, RW sampling deviates greatly from RE sampling. To investigate the cause for such deviation, we investigated several Web graphs on the domains of Notre Dame, Stanford, and Berkley-Stanford. The Facebook data is one of the few exceptions where RE sampling is inferior to RN sampling. Therefore, we include two Facebook graphs. Complete data description and programs can be found at http://cs.uwindsor.ca/jlu/degreevar. Their statistics are summarized in Table 2, sorted according to c, the coefficient of variation of the degrees. We make several observations on the datasets. First, most of them are scale-free networks as shown in Fig. 5. The frequency-degree slope is around 2, their corresponding degree-rank slope shall be around 1, the same slope we selected in our simulation studies. Some datasets, such as Facebook and Citation networks, have a curve that is reflected by the Zipf–Mandelbrot law we used. One exception is the RoadNet network that is closer to normal or log-normal distribution. Table 2 Statistics of the 18 graphs, sorted in decreasing order of the coefficient of degree variation c. Each graph has a citation indicating where the data is from. Graph

c

hdi

# Nodes

Twitter Kwak et al. (2010) WikiTalk Leskovec and Faloutsos (2006) BerkStan Leskovec and Faloutsos (2006) EmailEu Leskovec and Faloutsos (2006) Stanford Leskovec and Faloutsos (2006) Skitter Leskovec and Faloutsos (2006) Youtube Mislove et al. (2007) NotreDame Leskovec and Faloutsos (2006) Gowalla Leskovec and Faloutsos (2006) Epinion Leskovec and Faloutsos (2006) Google Leskovec and Faloutsos (2006) Slashdot Leskovec and Faloutsos (2006) Facebook-1 Wilson et al. (2009) Flickr Leskovec and Faloutsos (2006) Facebook-2 Viswanath et al. (2009) Amazon Leskovec and Faloutsos (2006) CitePatents Leskovec and Faloutsos (2006) RoadNet Leskovec and Faloutsos (2006)

35.95 26.32 14.51 13.66 11.51 10.46 9.64 6.40 5.54 4.02 4.00 3.35 3.14 2.64 1.55 1.27 1.20 0.35

70.51 3.90 20.10 3.02 15.21 13.09 5.27 6.69 9.67 10.69 10.03 12.27 14.27 43.52 25.77 11.89 8.77 2.82

41,652,230 2,388,953 654,782 224,832 255,265 1,694,616 1,134,890 325,729 196,591 75,877 855,802 82,168 2,937,612 105,720 63,392 410,236 3,764,117 1,965,206

485

J. Lu, H. Wang / Information Processing and Management 50 (2014) 476–491 10 10

7

10

(1)Twitter

6

10

7

5

10

10

(2)WikiTalk

6

(3)BerkStan 10

4

Frequency

10 10 10 10 10

5

10

4

10

3

10

2

10

10 10

10

10

Frequency

10 10

10

3

4

0

1

10

10

2

10

3

4

10

10

5

10

6

7

10

10

6

(7)Youtube

5

10

4

10

3

10

10

10

1

10

10

0

10

10

10

0

10

1

2

3

10

10

10

4

5

10

10

7

10

2

10

1

0

10

0

0

10

1

2

3

10

10

10

4

5

10

6

10

10

6

10

0

1

10

10

2

10

3

4

10

5

5

(8)NotreDame

10 10 10 10 10

0 0

10

1

2

10

10

3

10

4

5

3

10

10

3

10

10

1

0

10

0

0

10

1

10

2

10

3

10

4

5

10

10

5

10

0

1

10

10

2

10

3

4

10

5

10

10

4

10

10 0 0

10

1

10

3

10

4

10

10

4

10

3

10

2

10

3

10

10

10

4

10

5

10

0

10

1

10

10

2

10

3

4

10

5

10

5

10

(11)Google

5

(12)Slashdot 4

10 4

3

3

10

2

1

10

1

0

0

0

10

1

10

2

10

3

10

4

6

10

0

10

1

10

10

2

3

10

4

10

6

10

(17)Citation

5

(18)RoadNet

5

10

4

4

10

3

3

10

2

10

2

2

10

1

10

1 0

10

10

2

2

1

6

10

10

2

10

10

(16)Amazon 10

3

10 3

2

10

5

(15)Facebook-2

4

0

0

2

1

10

10

0

2

10 1

1

10

10 10

10

2

2

1

10

10

4

10

3

10

10

3

(14)Flickr 10

10

1

(10)Epinion 10

2

10

(9)Gowalla 4

10

5 4

3

10 10

2

10

10

10

3

3

4

(13)Facebook-1

6

10 10

10

10

10

5

10

10

(6)Skitter

5

10

4

10 10

2

2

6

10

4

4

2

10

10

10

5

(5)Stanford 10

10

3

10 10

10

Frequency

10

0

10

(4)EmailEu

5

5

1

1

6

0

10

1

10

2

10

3

4

10

degree

10

10

1

0

10

10

0

0

10

1

2

10

10

degree

3

4

10

10

10

0

10

1

2

10

degree

3

10

4

10

10

1

10 0 0

10

10

1

2

10

10

degree

3

10

4

10

1

1

10

0

0

0

10

1

10

10

2

10

3

10

0

10

degree

10

1

2

10

degree

Fig. 5. Degree distributions of 18 graphs. Plots are sorted in decreasing order of coefficient of variation c.

There are irregular data distributions, such as Flickr and Amazon that have broken lines. Also, Web graphs (sub-Figs. 3, 5, 8) do not form a straight line in the upper part of the log–log plots, indicating irregularity in the graph structure. Albeit the varieties of the datasets, we will show that our result withstands without exception. Second, it is interesting to note that two representative social networks Twitter and Facebook are in the two extremes of the spectrum of c values, due to the way the networks are formed. Twitter dataset is much larger, and allows unlimited number of followers, while Facebook has an upper limit for the number of friends. Therefore Twitter is a scale-free network with large degree variation, while Facebook has a sharp dropping curve causing a low c value. Besides, Facebook datasets are much smaller. Because of their structural difference, for Twitter data RE is a hundred times better than RN sampling, for Facebook data RE and RN samplings are similar. One technical detail needs to be mentioned is the sampling of Twitter data. It is the complete user network collected in 2009 (Kwak et al., 2010), which has billions of edges that cannot fit into computer memory. We use index engine Lucene to store the data in hard drive and use search engine to mimic the random sampling methods. We treat all the neighbours of a node as a ‘document’, and build a index for those ’documents’, or neighbours. Then, graph sampling can be accomplished by searching the index. 4.2. RE vs. RN sampling Despite their variety in degree distribution and topology, all the 18 datasets support our analytical result. We demonstrate this using a fixed sample size first in Fig. 6, then the trend over sample sizes in Fig. 7. Fig. 6 panel (A) demonstrates our main result, i.e., the variance ratios between RN and RE samplings have an almost linear relation with c2 =hdi, whose Pearson’s correlation coefficient is as high as 0.9867. In addition, the ratios are consistently higher than c2 =hdi, indicating that hdi3 is indeed an upper bound of the RE variance. The plot is in log–log scale so that the points are spread out along the axes. For this experiment, sample size n = 100, and the variances are obtained with 20,000 repetitions. There are only five datasets whose RN/RE ratio is slightly below one, i.e., RE sampling is slightly worse than RN sampling. A closer inspection of these datasets shows that they all have small degree variations as shown in Table 2 and Fig. 5. Both of the Citation and RoadNetwork are at the lower end of the c values. The RoadNetwork has maximal 12 degrees, and its degrees follow a log-normal distribution. The Facebook network has an upper limit on the number friends, thus the maximal degree is abnormally small compared with its size. The Flickr network has an irregular degree distribution that has a large bump around degree 100. Panels (B) and (C) in Fig. 6 are plotted to corroborate our conclusion. Panel (B) shows that V 3 is indeed the upper bound for the RE variance. For all the 18 datasets, V 3 is consistently larger than the observed variance. For some datasets such as Epinion and Slashdots, the upper bound is rather close to the true variance. Panel (C) verifies that V 1 , the approximation obtained by the Taylor expansion, is very close to the real variance. Fig. 7 shows the trend of the variances over various sample sizes, and reconfirms the relationship between variances V 1 ; V 2 ; V 3 , and the real one. It demonstrates that V 3 > V 2 > V 1  observ ed v ariance as expected. Overall, V 1 is very close to the observed variance, as the Taylor expansion can approximate the original function well. V 2 and V 3 are also close in general, since their difference is dictated by hdi=ðhdi  1Þ. When average degree hdi is small, as in RoadNet, WikiTalk and

486

J. Lu, H. Wang / Information Processing and Management 50 (2014) 476–491 3

10

WikiTalk EmailEu 2

10

Twitter

RN/RE of variance

BerkStan Stanford Youtube Skitter NotreDame

1

10

Google Gowalla

Epinions Slashdot

Amazon 0

10

Citation FlickrEdges

Facebook1

RoadNet Facebook2

−1

10

RN/RE of variance When x=y −2

10

−2

−1

10

0

10

1

10

2

10

3

10

10

γ2/

(A)RN/REratio 104

10

3 Twitter

Twitter

103

10

FlickrEdges

2 FlickrEdges

Facebook2

2

10

10

BerkStan

Stanford

Amazon

1

Facebook1 Skitter Slashdot Epinions

10

Facebook1 BerkStan

1

Var V1

Upper bound V3

Facebook2

StanfordSkitter Gowalla Google

10

Google Gowalla

Citation

0

NotreDame Amazon

Youtube

Citation

WikiTalk

NotreDame EmailEu

Youtube

100

Slashdot Epinions

−1

10 WikiTalk

Upper bound V3

EmailEu RoadNet

10 −2 10

Var V

1

RoadNet

When x=y

−1

When x=y

−2

10

−1

10

0

10

1

10

2

10

3

10

4

10 −2 10

10

−1

10

0

10

Observed variance

Observed var

(B) V3

(C) V1

1

10

2

10

3

Fig. 6. Variance of RE sampling for 18 datasets. (A) RN/RE ratio is always higher than c2 =hdi; (B) V 3 , the upper bound, is greater than the observed variance; (C) V 1 , the approximation obtained by Taylor expansion, is close to the observed variance. Sample size n = 100. Variances are obtained from 20,000 repetitions.

EmailEU, the difference between V 2 and V 3 is noticeable. The largest gap is between V 1 and V 2 , which is determined by P P N= Ni¼1 1=di . As we have shown in simulation studies in Section 3.3, N= Ni¼1 1=di is around two when the exponent a ¼ 1 P and average degree is 10. That is, V 2 is about twice as large as V 1 . In real datasets, N= Ni¼1 1=di varies from data to data, ranging between 1.19 (WikiTalk) to 7.55 (Flickr). Another perspective to understand the reduced variance of RE sampling is the sample distributions that are depicted in Fig. 8, where the sample size is 8000. It shows that most of the sample distributions have a ‘‘V’’ shape, indicating that the small nodes still follow a power law as in the original data, while the large nodes can be sampled many times. In RE sama pling, the sampling probability PðkÞ of degree k is determined by k and its frequency f ðkÞ. Bearing in mind that f ðkÞ / k , where a is normally around two, we have: a

PðkÞ / k  f ðkÞ / k  k

¼k

ða1Þ

:

ð21Þ

487

J. Lu, H. Wang / Information Processing and Management 50 (2014) 476–491

Twitter

WikiTalk

Variances

4000

0.8 V3 V2 V1 Obs. var

3000

BerkStan

EmailEu

100 80

0.6

Stanford

0.4

40

0.3

30

0.2

20

0.1

10

0

0

20

60

2000

0.4

1000

0.2

0

0

15

40

0

500

1000

0

Youtube

500

1000

3

1000

0

500

1000

15

500

1000

0

30

1000

25

800

10

4 5 2

0

500

0 1000

0

Facebook2

500

1000

0

0

Amazon

500

200

20

8

0

500 Sample size n

1000

0

0

150

15

6

500

1000

RoadNet

0.2 0.15

100

10

4

50

5

2

0.1

200

5

0

0.25

400

10

1000

Citation

600

15

0

5

FlickrEdges

Facebook1

20

0

1000

8

2 1000

500 Slashdot

20

6

500

0

15 10

4

0

1000

10

6

0.5 0

500

12

8

1

0

0 0

Google

1.5 0.5

5

Epinions

10

2

1

500 Gowalla

2.5 Variances

0

NotreDame

1.5

Variances

10

20 0

Skitter 25

0

500 Sample size n

1000

0

0

500

1000

Sample size n

0

0

500

1000

Sample size n

0

0.05 0

500 Sample size n

1000

0

0

500

1000

Sample size n

Fig. 7. Variances vs. sample size for 18 datasets. Sample size n ranges between 100 and 1000. Variances are obtained over 1000 repetitions. The upper bound V 3 is higher than the observed variance.

(1)Twitter

(7)Youtube

(13)Facebook−1

(2)WikiTalk

(8)NotreDame

(14)Flickr

(3)BerkStan

(9)Gowalla

(15)Facebook−2

(4)EmailEu

(10)Epinion

(16)Amazon

(5)Stanford

(6)Skitter

(11)Google

(12)Slashdot

(17)Citation

(18)RoadNet

Fig. 8. Degree distributions of the samples obtained from RE sampling. n = 8,000.

Therefore, the sampling probability still follows a power law with the exponent a  1. Same as the power law for degree distribution, the formula is accurate only when k is small. When k is large, f ðkÞ  1 in the formula. In real data, f ðkÞ, the frequency, cannot be a fractional number. Instead, f ðkÞ must be zeros in most cases so that in average they can follow the

488

J. Lu, H. Wang / Information Processing and Management 50 (2014) 476–491

power law. When f ðkÞ is a non-zero number such as one, PðkÞ amplifies it by a factor of k, thereby generates the ascending branch in the sampling frequency plot. In other words, both small and large nodes are sampled multiple times but for different reasons. Small nodes are sampled because there are many of them. Although each individual small node has a very small probability of being sampled, collectively the large number of small nodes will guarantee that some will be sampled. On the other hand, large nodes are sampled because they have higher probability of being hit by random edges, even though there are only a few of them. Therefore both small and large nodes are well represented in the sample, resulting in small variance of the estimation. In RN sampling, large nodes are included by chance, inducing a large variance in estimation. The datasets that do not have the ‘‘V’’ shape in RE sampling happen to be the ones not in favour of RE sampling. They do not have nodes that are large enough to be sampled many times. Their RE sample distributions are just similar to the original data, or to RN sample distribution. Therefore RE sampling does not have an advantage in this kind of data. Two of the representative online social networks are Twitter and Facebook. It is interesting to see that they favour different sampling methods, one RE sampling and the other RN sampling. Moreover, their RN/RE ratios happen to be on the two extremes of the spectrum. Twitter has the third highest RN/RE ratio because it is scale-free and the largest network in our experiment. Facebook-2 has the lowest RN/RE ratio because it has a cap on the number of friends. 4.3. RW sampling RW sampling can be regarded as an approximation to RE sampling in that asymptotically the node sampling probability is proportional to its degree. The sampling probability is not exactly PPS, yet the PPS estimator is used. Therefore, the bias of the estimator can no longer be omitted in our evaluation. Both bias and variance should be considered, and they can be measured by RRMSE (Relative Root MSE) as defined below:

c ¼ 1 RRMSEð hdiÞ hdi

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 1 Xn  c hdi i  hdi i¼1 n

ð22Þ

c is an estimator, hdi is the true average degree, hdi c is the estimation obtained in the i-th run. In our experiments, all where hdi i the RRMSE data are obtained by 5000 independent runs, except for Twitter data that has 2000 runs due to its large size and the long computation time of the sampling. Fig. 9 shows the comparison of three estimators. Our first observation is that RW is worse than RE consistently as expected, since it is an approximation of RE sampling. To understand exactly how much worse RW is, we plot RW/RE ratio in Fig. 10 along graph conductance U (Sinclair & Jerrum, 1988), which can reflect mixing time of random walk. Very small graph conductance indicates the existence of loosely connected components. For datasets with small conductance such as NotreDame, Stanford, BerkStan and Flickr, RW is worse than RE sampling by a factor of ten in terms of RRMSE. In terms of variance, it is worse by a factor of a hundred. To develop the intuition for such poor performance, we plot their random walk traces when the estimations have large bias in Fig. 11. Each RW trace contains 104 steps. All four datasets, especially Flickr and NotreDame, have an extremely dense component that dangles loosely from the main component. This shows that RW sampling depends not only on the variation of the degrees, but also the topological structure of the graph. In the Flickr graph, there are two almost disconnected components.

Fig. 9. RE, RW, and RN samplings on 18 graphs in terms of RRMSE. Sample size n = 400, and RRMSEs are obtained over 5000 runs except for Twitter (2000 runs).

489

J. Lu, H. Wang / Information Processing and Management 50 (2014) 476–491 18 16

Ratio of RW and RE

14 12

Facebbok−2 Youtube Amazon Flickr NotreDame Stanford BerkStan

Pearson Corr:0.68652

10 8 6 4 2 0 1.5

2

2.5

3

3.5

4

4.5

−log10 (φ) Fig. 10. Standard error ratio between RW and RE vs. graph conductance U for 18 datasets. Sample size is 400.

Flickr

Berkeley and Stanford web

NotreDame

Stanford Web

Fig. 11. Random walks on four graphs, each has loosely connected components. Each random walk contains 104 steps.

Random walks will happen mostly either in one of the component, and the corresponding estimations are the average degrees for one of the components, not the entire graph. In the NotreDame data, there is a very large star on the right that resembles the graph in Fig. 2, indicating many nodes are only connected to the centre of the star. When a random walk is trapped inside this star, the estimated average degree will be around two as shown in the example in Section 3.1, no matter what the true value is. Given this relationship between RW and RE sampling, our second observation is that RW outperforms RN only (1) when RE outperforms RN (or the degree variance is large); (2) when there is no loosely connected components (or the conductance is not very small). When the RE is worse than RN, RW will be also worse than RN. Therefore there is no need to test RW method or improve RW with its various extensions. When there are loosely connected components, we can modify the simple random walk sampling so that it can approximate RE sampling, e.g., by uniform random restart.

490

J. Lu, H. Wang / Information Processing and Management 50 (2014) 476–491

5. Conclusions The size of the data, compounded by the power law distribution, is changing the landscape of sampling practice. Uniform random sampling is no longer the method of choice. This does not happen until the data size reaches a threshold for a give degree distribution, as illustrated in Fig. 3 Panel D. The gap between RE and RN samplings grows almost linearly as the data size. It can be infinitely large in theory, and is orders of magnitude in observed data. Such a large difference is particularly important for web-based networks, such as online social networks and the deep web, where the sampling process is costly because of network traffic and daily quota. It is remarkable to notice that it is uniform random node (RN) sampling that is on the downside of the comparison. In the past, great efforts are devoted to obtain uniform random samples using methods such as Metropolis–Hasting Random Walk and rejection sampling (Bar-Yossef & Gurevich, 2008). During the sampling process many nodes are visited, examined, and rejected. In the end these precious uniform random samples can be much worse than the samples obtained using low cost RE or RW methods. While it is easy to understand that uniform random sampling has large estimation error for data with large variance, it is not straightforward to see whether RE sampling can reduce the variance for data of various distributions. We show that the variance of RE estimator has an upper bound hdi3 =n. This upper bound is derived independent of the data distribution, and is close to the real variance when the data follows a power law distribution. We generate several synthetic scale-free networks to verify and explain our derivations, and use 18 real networks to support our result. The upper bound of the variance implies that RE reduces the variance of RN sampling when the graph is large. First, the variance ratio between RN and RE samplings is at least c2 =hdi. Although the derivation involves several approximations, it is remarkable that the observed RN/RE ratio has a high linear correlation with c2 =hdi. The Pearson’s correlation coefficient is 0.9867 among the 18 real networks we studied. Second, the comparison between RN and RE depends on the values of c2 and hdi. For a typical scale-free networks (degree vs. frequency power law slope is 2), we show that c2 grows almost linearly with data size for the same data distribution. When data size is small (N < 6  104 ), c2 can be smaller than hdi. However, when the data size grows, c2 is much larger than hdi. In other words, RE is much better than RN for large graphs. Empirically we demonstrate the improvement ratio is greater than a hundred for Twitter, EmailEU, and WikiTalk. In theory, we project larger improvement ratio can be found. The dependency on data size may also explain why such variance reduction was not observed in the literature. Variance reduction happens only for very large data that are available only recently. If the data is not very large, RE may not be as good as RN even if the data is scale-free. When RE sampling is not possible, we can use RW to approximate it in that both methods sample nodes with probability proportional to its size. The difference is that RW is a PPS sampling only asymptotically. Thus the performance of RW sampling differs from data to data. Our experiments show that in general RW sampling performs a little bit below RE sampling as expected, but sometimes it can be much worse, even worse than RN sampling when there are loosely connected components in the graph characterized by graph conductance. In retrospect, RE sampling is not widely studied, probably because that in most real situations, nodes are the primary objects – they are represented explicitly, and can be searched, queried, and crawled. In other words, nodes can be sampled in various ways. Edges, on the other hand, come as secondary objects that reside in nodes, and can be accessed from the nodes only. In the Web, web pages (the nodes) can be sampled using various methods, while the edges are only revealed as a byproduct when we crawl the Web from one page to another. In social networks, we sample people (the nodes), while the relations between people can be accessed from people, not the other way around. In software component networks, classes and objects are represented explicitly, while the relations between the classes can be obtained from those objects or classes. New developments in the digitalized world are making RE sampling more common. When using random queries to sample documents, long documents are sampled more often. It is a PPS sampling, or RE sampling when we view the queries as edges connecting the documents. When using random messages/emails to sample users, it is a RE sampling for user network connected by messages/emails. In the semantic web, edges are explicitly represented as RDF triples. Acknowledgements The authors thank the anonymous reviewers for their thoughtful suggestions and detailed comments. Without their continuous and constructive inputs, this paper could not make the progress this far. We would also like to thank Dingding Li and Chase Chance for their comments, and NSERC (Natural Sciences and Engineering Research Council of Canada) for its support on the research. References Avrachenkov, K., Ribeiro, B., & Towsley, D. (2010). Improving random walk estimation accuracy with uniform restarts. In Algorithms and models for the webgraph (pp. 98–109). Springer. Barabási, A., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286(5439), 509–512. Bar-Yossef, Z., & Gurevich, M. (2008). Random sampling from a search engine’s index. Journal of the ACM, 55(5), 1–74. Broder, A. et al (2006). Estimating corpus size via queries. In CIKM (pp. 594–603). ACM. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., et al (2000). Graph structure in the web. Computer Networks, 33(1), 309–320.

J. Lu, H. Wang / Information Processing and Management 50 (2014) 476–491

491

Callan, J., & Connell, M. (2001). Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2), 97–130. Dasgupta, A., Kumar, R., & Sivakumar, D. (2012). Social sampling. In SIGKDD (pp. 235–243). ACM. Feige, U. (2004). On sums of independent random variables with unbounded variance, and estimating the average degree in a graph. In Proceedings of the thirty-sixth annual ACM symposium on theory of computing (pp. 594–603). ACM. Gjoka, M., Kurant, M., Butts, C., & Markopoulou, A., (2009). A walk in facebook: Uniform sampling of users in online social networks. Arxiv preprint arXiv:0906.0060. Goldreich, O., & Ron, D. (2004). On estimating the average degree of a graph. Electronic Colloquim on Computational Complexity (ECCC). Henzinger, M., Heydon, A., Mitzenmacher, M., & Najork, M. (2000). On near-uniform url sampling. Computer Networks, 33(1–6), 295–308. Jackson, M. (2008). Social and economic networks. Princeton University Press. Katzir, L., Liberty, E., & Somekh, O. (2011). Estimating sizes of social networks via biased sampling. In WWW (pp. 597–606). ACM. Kolaczyk, E. D. (2009). Statistical analysis of network data. Springer. Kurant, M., Butts, C., & Markopoulou, A., (2012). Graph size estimation. ArXiv preprint arXiv:1210.0460. Kwak, H., Lee, C., Park, H., & Moon, S. (2010). What is twitter, a social network or a news media? In WWW (pp. 591–600). ACM. Lawrence, S., & Giles, C. (1998). Searching the world wide web. Science, 280(5360), 98–100. Lee, S., Kim, P., & Jeong, H. (2006). Statistical properties of sampled networks. Physical Review E, 73(1), 016102. Leskovec, J., & Faloutsos, C. (2006). Sampling from large graphs. In SIGKDD (pp. 631–636). ACM. Lovász, L. (1993). Random walks on graphs: A survey. Combinatorics, Paul Erdos is Eighty, 2(1), 1–46. Lu, J., & Li, D. (2010). Estimating deep web data source size by capture–recapture method. Information Retrieval, 13(1), 70–95. Lu, J., & Li, D. (2012). Sampling online social networks by random walk. In ACM SIGKDD workshop on hot topics in online social networks (pp. 33–40). ACM. Lu, J., & Li, D. (2013). Bias correction in small sample from big data. TKDE, IEEE Transactions on Knowledge and Data Engineering, 25(11), 2658–2663. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., & Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21, 1087. Mislove, A., Marcon, M., Gummadi, K., Druschel, P., & Bhattacharjee, B. (2007). Measurement and analysis of online social networks. In SIGCOMM (pp. 29–42). Springer. Montemurro, M. (2001). Beyond the Zipf–Mandelbrot law in quantitative linguistics. Physica A: Statistical Mechanics and Its Applications, 300(3), 567–578. Newman, M. E. J. (2005). Power laws, pareto distributions and Zipf’s law. Contemporary Physics, 46, 323. Newman, M. (2010). Networks: An introduction. Oxford University Press, Inc.. Papagelis, M., Das, G., & Koudas, N. (2013). Sampling online social networks. IEEE Transactions on Knowledge and Data Engineering, 25(3), 662–6761. Rasti, A. et al (2009). Respondent-driven sampling for characterizing unstructured overlays. In INFOCOM (pp. 2701–2705). IEEE. Ribeiro, B., & Towsley, D. (2010). Estimating and sampling graphs with multidimensional random walks. In Annual conference on internet measurement (pp. 390–403). ACM. Salganik, M., & Heckathorn, D. (2004). Sampling and estimation in hidden populations using respondent-driven sampling. Sociological Methodology, 34(1), 193–240. Si, L., & Callan, J. (2003). Relevant document distribution estimation method for resource selection. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval (pp. 298–305). Toronto, Canada: ACM. Sinclair, A., & Jerrum, M., (1988). Conductance and the rapid mixing property for markov chains: The approximation of the permanent resolved. In Proc. 20th ACM STOC (pp. 235–244). Stumpf, M., & Wiuf, C. (2005). Sampling properties of random graphs: The degree distribution. Physical Review E, 72(3), 036118. Stumpf, M., Wiuf, C., & May, R. (2005). Subnets of scale-free networks are not scale-free: Sampling properties of networks. PANAS, 102(12), 4221. Viswanath, B., Mislove, A., Cha, M., & Gummadi, K. P., (2009). On the evolution of user interaction in facebook. In Proceedings of the 2nd ACM SIGCOMM workshop on social networks (WOSN’09). Wang, T., Chen, Y., Zhang, Z., Xu, T., Jin, L., Hui, P., et al (2011). Understanding graph sampling algorithms for social network analysis. In 2011 31st International conference on distributed computing systems workshops (ICDCSW) (pp. 123–128). IEEE. Wang, Y., Liang, J., & Lu, J., (in press). Discover hidden web properties by random walk on bipartite graph. Information retrieval (pp. 27). Springer. Wejnert, C., & Heckathorn, D. (2008). Web-based network sampling. Sociological Methods & Research, 37(1), 105–134. Wilson, C., Boe, B., Sala, A., Puttaswamy, K., & Zhao, B. (2009). User interactions in social networks and their implications. In Proceedings of the 4th ACM European conference on computer systems (pp. 205–218). ACM.