Analysis of hybrid P2P overlay network topology

Analysis of hybrid P2P overlay network topology

Available online at www.sciencedirect.com Computer Communications 31 (2008) 190–200 www.elsevier.com/locate/comcom Analysis of hybrid P2P overlay ne...

1MB Sizes 0 Downloads 102 Views

Available online at www.sciencedirect.com

Computer Communications 31 (2008) 190–200 www.elsevier.com/locate/comcom

Analysis of hybrid P2P overlay network topology Chao Xie

a,*

, Guihai Chen c, Art Vandenberg d, Yi Pan

q,qq

b,*

a

d

Department of Computer Science, University of Wisconsin-Madison, Madison, WI, 53706-1685, USA b Department of Computer Science, Georgia State University, Atlanta, GA 30302-3994, USA c State Key Laboratory of Novel Software, Nanjing University, Nanjing 210093, China Department of Information Systems and Technology, Georgia State University, Atlanta, GA 30302-3968, USA Available online 19 August 2007

Abstract Modeling peer-to-peer (P2P) networks is a challenge for P2P researchers. In this paper, we provide a detailed analysis of large-scale hybrid P2P overlay network topology, using Gnutella as a case study. First, we re-examine the power-law distributions of the Gnutella network discovered by previous researchers. Our results show that the current Gnutella network deviates from the earlier power-laws, suggesting that the Gnutella network topology may have evolved a lot over time. Second, we identify important trends with regard to the evolution of the Gnutella network between September 2005 and February 2006. Upon analyzing the limitations of the power-laws, we provide a novel two-layered approach to study the topology of the Gnutella network. We divide the Gnutella network into two layers, namely the mesh and the forest, to model the hybrid and highly dynamic architecture of the current Gnutella network. We give a detailed analysis of the two-layered overlay and present six power-laws and one empirical law to characterize the topology. Using the two-layered approach and laws proposed, realistic topologies can be generated and the realism of artificial topologies can be validated.  2007 Elsevier B.V. All rights reserved. Keywords: Peer-to-peer; Overlay network; Network topology; Power-law

1. Introduction Modeling the topologies of peer-to-peer (P2P) networks is an important open problem. An accurate topological model can have significant influence on P2P research. First, we can gain detailed insight into the nature of the underlying system. Second, the model can enable detailed analysis q This paper extends and supplants the earlier version of this paper presented at IEEE GLOBECOM’06 [1]. qq Guihai Chen’s work is supported by China NSF under Grant 60573131, China Jiangsu Provincial NSF under Grant BK2005208, China 973 projects under Grants 2006CB303000 and 2002CB312002, and Nokia Bridging the World Program. Yi Pan’s work is supported in part by the National Science Foundation (NSF) under Grants ECS-0196569, ECS0334813, and CCF-0514750. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the NSF, China NSF or Nokia. * Corresponding authors. Tel.: +1 404 651 0649; fax: +1 404 463 9912. E-mail addresses: [email protected] (C. Xie), [email protected] (G. Chen), [email protected] (A. Vandenberg), [email protected] (Y. Pan). URLs: http://www.cs.wisc.edu/~cxie (C. Xie), http://www.cs.gsu.edu/ pan (Y. Pan).

0140-3664/$ - see front matter  2007 Elsevier B.V. All rights reserved. doi:10.1016/j.comcom.2007.08.014

of algorithms and facilitate design of more efficient protocols that take advantage of topology properties. Third, we can generate more accurate artificial topologies for simulation purposes. Furthermore, we can predict future trends and thereby address potential problems in advance. Previous researchers [2] and [7] tended to use power-laws to characterize the topology of P2P networks. Recent advances in P2P networks have resulted in hybrid architectures, represented by the success of Gnutella protocol 0.6 [3] and Kazaa [4]. In this paper, we provide a detailed analysis of large-scale hybrid P2P network topology, giving results concerning major topology properties and main distributions. In our study, we choose Gnutella as a case study, as it has a large user community and open architecture. Our work can be summarized by the following points. First, we re-examine the power-law distributions of the Gnutella network discovered by previous researchers. Our results show that the current Gnutella network deviates from the earlier power-laws. This observation suggests that the Gnutella network topology may have evolved a lot over time.

C. Xie et al. / Computer Communications 31 (2008) 190–200

Second, we identify important trends with regard to the evolution of the Gnutella network between September 2005 and February 2006. As our primary contribution, we provide a novel twolayered approach to study the topology of the Gnutella network. Due to the limitations of the power-laws, we divide the Gnutella network into two layers, namely the mesh and the forest, to model the hybrid and highly dynamic architecture of the current Gnutella network. We give a detailed analysis of the two-layered overlay and present six power-laws and one empirical law to characterize the topology. Finally, we focus on the generation of realistic topologies and the validation of artificial topologies using our approach and laws proposed. The rest of this paper is organized as follows. Section 2 presents background and previous work. In Section 3, we present our traces of the Gnutella network. In Section 4, we re-examine the power-law distributions discovered by previous researchers and identify the trends concerning the evolution of Gnutella network. In Section 5, we analyze the limitations of the power-laws and introduce our new two-layered approach to study the topology of Gnutella network. In Section 6, we analyze the topological properties of the mesh and present two power-laws concerning the mesh topology. In Section 7, we examine the topology properties of the forest and provide one empirical law concerning the tree size. In Section 8, we present to two two power-laws concerning the overlay network as a whole and discuss the practical uses of our approach and laws. Finally, Section 9 concludes our work.

191

Gnutella protocol 0.6 [3] employs a hybrid architecture combining centralized and decentralized model. Servents are categorized into leaf and ultrapeer. A leaf keeps only a small number of connections to ultrapeers. An ultrapeer maintains connections with other ultrapeers and acts as a proxy to the Gnutella network for the leaves connected to it. An ultrapeer only forwards a query to a leaf if it believes the leaf can answer it, and leaves never relay queries between ultrapeers. Fig. 2 illustrates the topology of the Gnutella 0.6 network. Protocol 0.6 is compatible with protocol 0.4, which implies that the current Gnutella network can contain some fraction of nodes of former protocol specification 0.4. 2.2. Power-law Power-laws have been found in numerous diverse fields spanning sociological, geological, natural and biological systems. Power-laws of the form y  xa enables a compact characterization of topologies through their exponents. Faloutsos et al. [8] discovered four power-laws characterizing the topology of the Internet, while Magoni et al. [9] found another four power-laws of the Internet. In [2,7,11], several power-laws were found with regard to the topology of the Gnutella network. In 2002, Ripeanu et al. [10] argued that the connection distribution of the more recent Gnutella network may follow a two-tier power-law distribution. P2P studies usually assume that these power-laws characterize the topology of P2P networks and use synthetically generated topologies following these power-laws [12–17].

2. Background and previous work 3. Our Gnutella Network Traces 2.1. Gnutella Protocol and the crawler Gnutella protocol 0.4 [5] employs a pure decentralized model. In this model, individual nodes, also called servents are equal in terms of functionality. They not only perform server-side roles such as matching incoming queries against their local resources and respond with applicable results, but also offer client-side functions such as issuing queries and collecting search results. All servents are connected to each other randomly. Fig. 1 illustrates the topology of the Gnutella 0.4 network.

We developed a crawler to collect topology information of the Gnutella network, taking advantage of message communication mechanism of both protocol 0.4 and protocol 0.6. The crawler is based on the Limewire [6] open source client and performs a breadth first searching on the network in parallel. It can discover more than 100,000 nodes in minutes. We can build the graph of nodes by analyzing the collected data on the Gnutella network. We model two adjacent nodes that have at least one connection between

Fig. 1. Topology of the Gnutella 0.4 Network.

Fig. 2. Topology of the Gnutella 0.6 Network.

192

C. Xie et al. / Computer Communications 31 (2008) 190–200

Table 1 Basic Statistics of the Gnutella Network Stat. Data

Ours

[11]

[2]

091505

021106

V34206

V57926

Time Nodes Edges l Diam. k

09–2005 107,205 118,187 6.4 22 2.20

02–2006 118,925 130,612 7.9 24 2.20

09–2003 34,206 43,958 5.4 16 2.57

10–2003 57,926 80,276 5.8 15 2.72

11–2000 992 2465 3.7 9 4.97

12–2000 1,125 4080 3.3 8 7.25

each other by an edge. We treat the Gnutella network as a undirected graph. In this paper, we provide two traces of the Gnutella network, namely the 091505 trace and the 021106 trace. Note that we have studied the topology of the Gnutella network from September 2005 until February 2006 and all the traces we have gotten accord with the results given in this paper. In Table 1, we present some basic statistics about our traces and previous work [2,11]. In Table 1, l represents the average shortest distance and k represents the average degree. 4. Current Gnutella network topology In this section, we examine the power-laws of the Gnutella network described in previous literatures against our two traces. The goal of our work is to find out whether the topology of the current Gnutella network accords with the early power-laws. We use linear regression to fit a line in a set of twodimensional points using the least-square errors method. The validity of the approximation is quantified by the correlation coefficient ranging from 1.0 and 1.0. The absolute value of the correlation coefficient is ACC. An ACC value of 1.0 indicates perfect linear correlation. In general, the ACC level should be greater than 0.90 to validate linear correlation. 4.1. Rank distribution In this section, we study the degrees of the nodes in the Gnutella network. Power-law of rank exponent R: The degree dv of a node v is proportional to the rank of the node rv to the power of a constant R : d v / rR v . The rank rv of a node v is defined as its index in the order of decreasing degree. Jovanovic [2] found that the early Gnutella network followed the above power-law with rank exponent of 0.98 and ACC of 0.94. For our two traces, the rank exponent is 0.64268 and 0.60681 and ACC is 0.92178 and 0.88120 in chronological order as we see in Fig. 3. The low ACC values imply that this power-law is relatively weak in the 091505 graph and even invalid for the 021106 graph. Compared with a pure power-law distribution, the two graphs deviate from the linear regression with similar

Fig. 3. Log–log plot of the degree dv versus the rank rv in the sequence of decreasing degree.

patterns. On the one hand, the nodes with high rank are of too small degree. This is because the Gnutella protocol 0.6 imposes a limit on maximal connections of an ultrapeer. On the other hand, there are too many nodes with degree around 30, with the result that the curve breakouts from the linear regression. This pattern suggests that ultrapeers in the Gnutella 0.6 network tend to have the connection limit around 30. Moreover, the 021106 graph is somewhat different from the 091505 graph. First, the nodes with high rank in the former graph are of smaller degree compared with the counterparts in the latter, implying that protocol 0.6 is effectively replacing protocol 0.4. Secondly, the curve after a degree of approximately 30 drops much more suddenly in the former graph than in the latter, which suggests that ultrapeers tend to employ as many connections as they can. 4.2. Degree Distribution In this section, we study the distribution of the degrees of the nodes. Note that the degree power law we present in the current work is different from the one in earlier work [2]. However, they both refer to the same distribution. The

C. Xie et al. / Computer Communications 31 (2008) 190–200

difference is that the current work uses the cumulative probability distribution function, while the earlier work uses the probability distribution function. As a result, the exponents of the two power-laws differ approximately by one. The cumulative distribution is preferable because it can be estimated in a statistically robust way. Power-law of degree exponent D: The complementary cumulative distribution function (CCDF) Dd of a degree d, is proportional to the degree to the power of a constant D : Dd / d D . The CCDF of a degree d is the percentage of nodes that have degree greater than the degree d. Jovanovic [2] showed degree exponent of 1.4 and ACC of 0.96 for the early Gnutella network by probability distribution. For our two traces, the degree exponent is 2.25926 and 2.31074 and ACC is 0.91744 and 0.87718 in chronological order as we see in Fig. 4. Again, the low ACC values imply that this power-law is relatively weak in the 091505 graph and even invalid for the 021106 graph. Compared with a pure power-law distribution, the graphs share some common patterns. There are too many nodes with degree around 30, and the resulting curves deviate from the linear regression. This is coincident with what we found in rank distribution.

193

Furthermore, in the 021106 graph, degrees in interval 5– 20 follow an almost constant distribution, which means there are too few ultrapeers with a degree in this interval. This confirms our previous conclusion that ultrapeers try to hold more connections up to the limit. The curve of higher degree in the 021106 graph drops much more sharply, which agrees with our previous comment that the Gnutella protocol 0.6 prevents ultrapeers from employing a large number of connections. 5. The two-layered approach In this section, we first discuss the limitations of the power-laws and then present a new approach to study the topology of the Gnutella network. 5.1. Limitations of the power-laws Previous researches [18] and [19] suggest two key causes for power-law distributions in network topologies: incremental growth and preferential connectivity. Incremental growth refers to open networks that form by the continual addition of new nodes, and thus the gradual increase in the size of the network. Preferential connectivity refers to the tendency of a new node to connect to existing nodes that are highly connected or popular. The topology of the Gnutella network is highly dynamic, since a node can join or leave the Gnutella network at any time. More specifically, most leaves tend to disconnect from the Gnutella network in several minutes after they connect to the network. The transient life-time of the leaves works against incremental growth. Moreover, due to the hybrid architecture of Gnutella protocol 0.6 [3], a leaf keeps only a small number of connections to ultrapeers and cannot connect to other leaves. This limitation on leaves also works against preferential connectivity, because leaves can never become highly connected. Combining the above factors, we can explain why the current Gnutella network does not follow the early power-law distributions. It is the limitations of the power-laws that make them inappropriate for modeling hybrid and highly dynamic topologies. As we mentioned earlier, P2P studies usually use synthetically generated topologies characterized by the early power-laws. These topologies may not reflect properties of current P2P networks. So there should be a new approach to model current P2P networks. 5.2. Our approach

Fig. 4. Log–log plot of Dd versus the degree d.

In our study, we propose a new two-layered approach to model the topology of the current Gnutella network. We split the Gnutella network into two layers, namely the mesh and the forest. Before we present the analysis of our approach, we provide below a few definitions. Note that Magoni et al. [9] proposed some definitions to describe the AS network.

194

C. Xie et al. / Computer Communications 31 (2008) 190–200

We keep these definitions and modify them into the following ones. Fig. 5 shows different kinds of nodes in a sample graph. • Cycle node: a node that belongs to a cycle (i.e. it is on a closed path of disjoint nodes; in Fig. 5, there are eleven cycle nodes). • Bridge node: a node which is not a cycle node and is on a path connecting 2 cycle nodes (in Fig. 5, there is one bridge node). • In-mesh node: a node which is a cycle node or a bridge node (in Fig. 5, the mesh has twelve in-mesh nodes). • In-tree node: a node which is not an in-mesh node (i.e. it belongs to a tree; in Fig. 5, each tree has four in-tree nodes). Mesh is the set of in-mesh nodes and forest is the set of in-tree nodes. • • • • •

Branch node: an in-tree node of degree at least 2. Leaf node: an in-tree AS of degree 1. Root node: an in-mesh node which is the root of a tree. Relay node: a node having exactly 2 connections. Border node: a node located on the diameter of the network.

If we split the Gnutella network into the mesh and the forest, we can analyze the topological properties of the mesh and the forest, respectively. After careful comparison between Figs. 2 and 5, we can find that the mesh in Fig. 5 is composed merely of ultrapeers and acts as the backbone of the Gnutella network. Since ultrapeers are relatively stable and tend to stay in the Gnutella network for a longer time, it can meet the requirement of incremental growth. Further more, since ultrapeers can connect to other ultrapeers, it can meet the requirement of preferential connectivity. Hence, the topology of the mesh theoretically should comply with powerlaws (see Section 6 for detailed validation). On the other hand, we can also obtain major topology properties and distributions of the forest (see Section 7). Note that it is not necessary to have all ultrapeers in the mesh.

With the knowledge of both the topology of the mesh and the topology of the forest, we can model the topology of the Gnutella network easily by merging these two layers. 6. Mesh topology analysis In this section, we study the topology properties concerning the mesh in the Gnutella network. In Table 2, we present some basic statistics about the mesh in our traces. In Table 2, p(m) represents the percentage of nodes in the mesh, l represents average shortest distance, and k represents average degree. 6.1. Mesh node rank exponent Rm In this section, we study the degrees of the nodes in the mesh. We sort the nodes in the mesh in decreasing order of degree d vm and define the mesh node rank rvm as the index of the node in the sequence. We plot the ðd vm ; rvm Þ pairs in loglog scale. The plots are shown in Fig. 6. The data values are represented by points, while the solid lines represent the least-squares approximation. The points of Fig. 6 are well approximated by the linear regression. The ACC is 0.96425 for the 091505 trace and 0.96580 for the 021106 trace. This leads us to the following power law and definition. Power-law 1 (Mesh node rank exponent): The degree d vm of a mesh node vm is proportional to the rank of the mesh node rvm to the power of a constant Rm : m d vm / r R vm :

Definition 1. Let us sort the mesh nodes of a graph in decreasing order of degree. We define the mesh rank exponent Rm to be the slope of the plot of the degrees of the mesh nodes versus the rank of the nodes in log–log scale. 6.2. Mesh node degree exponent Om In this section, we study the distribution of the degrees of the nodes in the mesh. We define the frequency fd m of a mesh node degree dm as the number of nodes in the mesh with degree dm. We plot the (fd m ; d m ) pairs in log-log scale in Fig. 7. In these plots, we exclude a small percentage of nodes of higher degree that have frequency of one, but still plot 99.9% of the total number of nodes. As we saw earlier, Table 2 Basic Statistics of the Mesh

Fig. 5. Different kinds of nodes.

Stat. Data

091505

021106

Nb of Nodes p(m) Nb of Edges l Diameter k

16,487 15.4% 27,467 5.2 14 3.33

11,852 10.0% 23,539 6.5 17 3.97

C. Xie et al. / Computer Communications 31 (2008) 190–200

Fig. 6. Log–log plot of the mesh node degree d mv versus the rank rmv in the sequence of decreasing degree.

the higher degrees are described and captured by the mesh rank exponent. The major observation of Fig. 7 is that the plots are approximately linear with ACC of 0.97171 for the 091505 trace and 0.96016 for the 021106 trace. We infer the following power-law and definition. Power-law 2 (Mesh node degree exponent): The frequency fd m of a mesh node degree dm, is proportional to the degree to the power of a constant Om : m fd m / d O m :

Definition 2. We define the mesh node degree exponent Om to be the slope of the plot of the frequency of the mesh node degrees versus the degrees in log–log scale.

6.3. Mesh pair rank exponent P m In this section, we study the Number of distinct Shortest Paths (NSP) of each pair of vertices in the mesh. The number of distinct shortest paths between two vertices is the number of shortest paths such that any of these paths have

195

Fig. 7. Log–log plot of frequency fd m versus the mesh node degree dm.

at least one vertex not in common [9]. The distribution of NSP is useful for evaluating the amount of redundant edges involved in shortest path. Higher NSP values mean that if one edge of a shortest path between a pair of nodes is removed, there is still a probability for another shortest path of the same length to exist for this pair. We sort the pairs of in-mesh nodes in decreasing NSP npm and define the pair rank rpm as the index of the pair in the sequence. We plot the ðnpm ; rpm Þ pairs in log-log scale. The plots are shown in Fig. 8. Due to the enormous amount of node pairs, we plot the first 106 pairs only. The points of Fig. 8 are well approximated by the linear regression with ACC of 0.99157 for the 091505 trace and 0.99632 for the 021106 trace. Note that it seems that in Fig. 8(a) a significant portion of the upper left part of the curve goes off the straight line. However, this is a visual illusion. The dots in the lower right part of the curve are much more denser than the dots in the upper left part, resulting in a high ACC value all the same. This leads us to the following power law and definition. Power-law 3 (Mesh pair rank exponent). The NSP npm between a pair of mesh nodes pm, is proportional to the rank of the pair rpm to the power of a constant P m : npm / rPpmm :

196

C. Xie et al. / Computer Communications 31 (2008) 190–200

Fig. 8. Log–log plot of the mesh NSP npm versus the rank rpm in the sequence of decreasing degree.

Definition 3. Let us sort the pairs of nodes in the mesh of a graph in decreasing order of NSP. We define the mesh pair rank exponent P m to be the slope of the plot of the NSP versus the rank of the mesh node pairs in log-log scale. 6.4. Mesh NSP exponent N m In this section, we study the distribution of NSP of inmesh nodes. We define the frequency fnm of a NSP nm as the number of pairs with NSP of nm in the mesh. We plot the (fnm ; nm ) pairs in log-log scale in Fig. 9. In these plots, we exclude a small percentage of pairs of higher NSP that have lowest frequency, but still plot more than 99.9% of the total number of pairs. The solid lines are the result of the linear regression. The major observation of Fig. 9 is that the plots are approximately linear with ACC of 0.94301 for the 091505 trace and 0.99840 for the 021106 trace. We infer the following power-law and definition. Power-law 4 (Mesh NSP Exponent). The frequency fnm of a NSP between a pair of nodes in the mesh, nm, is proportional to the NSP to the power of a constant N m : m fnm / nN m :

Fig. 9. Log–log plot of frequency fnm versus the mesh NSP nm.

Definition 4. We define the Mesh NSP exponent N m to be the slope of the plot of the frequency of the mesh NSP versus the mesh NSP in log-log scale. 7. Forest topology analysis In this section, we study the topology properties concerning the forest in the Gnutella network. In Table 3, we present some basic statistics about the forest in our traces. In Table 3, p(t) represents the percentage of nodes in the forest. 7.1. Tree depth distribution We define the probability p(td) of a tree depth td as the percentage of trees in the forest with depth td. Fig. 10 describes the tree depth distribution. Table 3 Basic Statistics of the Forest Stat. Data

091505

021106

Nb of Nodes p(t) Nb of trees Mean tree size Max tree size Mean tree depth Max tree depth

90,718 84.6% 9886 10.18 4,824 1.52 8

107,073 90.0% 6830 16.68 231 1.30 10

C. Xie et al. / Computer Communications 31 (2008) 190–200

197

Fig. 10. Tree depth distribution.

In Fig. 10, we notice that more than 56% of trees are simply composed of leaves that is directly connected to their corresponding root. We can also observe that more than 27% of trees have depth 2 and less than 4% of trees have depth larger than 3. 7.2. Tree rank distribution In this section, we study the size of each tree, which is defined as the sum of the vertices composing the tree plus the root. We sort the trees in decreasing tree size st and define tree rank rt as the index of the tree in the sequence. We plot the (st,rt) pairs in Fig. 11, applying log-scale only on the y-axis. The solid lines are given by linear regression. The plots of Fig. 11 match the linear regression line. The ACC is 0.95621 for the 091505 trace and 0.95465 for the 021106 trace. Consequently, we infer the following empirical law and definition. Empirical law 1: The size st of a tree t, is proportional to an exponential function with exponent being the product of the rank of the tree rt and a constant T : st / expðT rt Þ:

Definition 5. Let us sort the trees of a graph in decreasing order of size. We define T to be the slope of the plot of the sizes of trees versus the rank of the trees with log-scale applied on the sizes of trees. This empirical law provides the formula on the sizes of trees in a sequence of trees. 8. Discussion In this section, we first present two more power-laws concerning all the nodes (including both in-mesh nodes and in-tree nodes) in the Gnutella network. Then we focus on the generation of synthetic topologies of P2P networks.

Fig. 11. Plot of the tree size st(log-scale) versus the rank rt in the sequence of decreasing size.

8.1. Additional power-laws In our study, we find that the NSP rank distribution and NSP distribution of all the nodes in the Gnutella network follow power-laws as well. This can be explained easily. Because the mesh is the core part of the network, shortest paths is mainly constituted by nodes in the mesh, while nodes in the forest barely contribute to shortest paths. However, the two power-laws presented below could be used as minor metrics to distinguish P2P topologies. 8.1.1. Pair rank exponent P Here we study the NSP of all the nodes (including both in-mesh nodes and in-tree nodes). We sort the pairs of the nodes in decreasing NSP np and plot the (np, rp) pairs in log–log scale in Fig. 12. Due to the enormous amount of node pairs, we plot the first 106 pairs only. The data values are represented by points, while the solid lines represent the least-squares approximation. The points of Fig. 12 are well approximated by the linear regression with ACC of 0.98184 for the 091505 trace and 0.99259 for the 021106 trace. Note that it seems that in both Fig. 12(a) and (b), a significant portion of the upper

198

C. Xie et al. / Computer Communications 31 (2008) 190–200

Fig. 13. Log–log plot of frequency fn versus the NSP n. Fig. 12. Log–log plot of the NSP np versus the rank of the pairs rp in the sequence of decreasing NSP.

np / rPp :

of pairs of higher NSP that have lowest frequency. In any case, we plot more than 99.9% of the total number of pairs. The solid lines are the result of the linear regression. The major observation is that the plots are approximately linear with ACC of 0.93510 for the 091505 trace and 0.98810 for the 021106 trace. We infer the following power-law and definition. Power-law 6 (NSP Exponent): The frequency fn of a NSP between a pair of nodes n, is proportional to the NSP to the power of a constant N :

Definition 6. Let us sort the pairs of nodes of a graph in decreasing order of NSP. We define the pair rank exponent P to be the slope of the plot of the NSP versus the rank of the pairs in log–log scale.

Definition 7. We define the NSP exponent N to be the slope of the plot of the frequency of the NSP versus the NSP in log-log scale.

left part of the curves goes off the straight line. However, this is also resulted from visual illusion. The dots in the nether right part of the curve is much more dense than the dots in the upper left part, resulting in that the ACC value is high all the same. This leads us to the following power law and definition. Power-law 5 (Pair Rank Exponent): The NSP np between a pair of nodes p, is proportional to the rank of the pair rp to the power of a constant P:

8.1.2. NSP Exponent N Here we study the distribution of NSP of all the nodes (including both in-mesh nodes and in-tree nodes). We define the frequency fn of a NSP n as the number of pairs with NSP of n. We plot the (fn, n) pairs in log–log scale in Fig. 13. In these plots, we exclude a small percentage

fn / nN :

8.2. Topology generation The regularity observed in our traces of the Gnutella network between September 2005 and February 2006 (including but not restricted to the two traces specifically discussed in this paper) is unlikely to be a coincidence.

C. Xie et al. / Computer Communications 31 (2008) 190–200

We could reasonably conjecture that our laws might continue to hold, at least for the near future. Our work can facilitate the generation of realistic topologies of P2P networks, specially those which employ a hybrid and highly dynamic architecture like the Gnutella network. As an overview, we list the following guidelines for creating P2P network topologies. First, a small percentage of the nodes (15.4% or 10.0%) belong to the mesh and a large percentage of the nodes (84.6% or 90.0%) belong to the forest. Second, the degree distribution of the mesh is skewed following our power-law 1 and 2. Third, more than 56% of the trees have depth one, less than 4% of the trees have depth larger than 3, and the maximum depth is 7 or 10. Fourth, the size distribution of the trees is skewed following our empirical law 1. As a final step, we merge the generated mesh and the generated forest together to get the P2P network topology. We can further use our law 3, law 4, law 5, and law 6 to examine the quality of the generated topologies. If we finetune the parameters, we can get specific topologies that meet our needs. 9. Conclusion and future work In this paper, we study the hybrid P2P network topology through the mesh perspective and the forest perspective respectively. Using the two-layered approach and laws proposed, realistic topologies can be generated. References [1] C. Xie, Y. Pan, Analysis of large-scale hybrid peer-to-peer network topology, in: Proc. IEEE GLOBECOM’06, San Francisco, USA, 2006. [2] M.A. Jovanovic, Modelling large-scale peer-to-peer networks and a case study of gnutella, Master’s thesis, University of Cincinnati, Cambridge , June 2000. [3] Gnutella, The gnutella protocol v0.6, 2002. [4] The KaZaA website, 2006. [5] Clip2, The Gnutella protocol specification v0.4, 2001. [6] The Limewire website, 2006. [7] L.A. Adamic, R.M. Lukose, A.R. Puniyani, B.A. Huberman, Search in power-law networks, Physical Review E 64 (2001) 46135–46143. [8] M. Faloutsos, P. Faloutsos, C. Faloutsos, On power-law relationships of the internet topology, in: Proc. ACM SIGCOMM’99, New York, NY, 1999, pp. 251–262. [9] D. Magoni, J.-J. Pansiot, Analysis of the autonomous system network topology, ACM SIGCOMM Computer Communication Review 31 (3) (2001) 26–37. [10] M. Ripeanu, I. Foster, A. Iamnitchi, Mapping the Gnutella network: properties of large-scale peer-to-peer systems and implications for system design, IEEE Internet Computing Journal 6 (1) (2002) 50–57. [11] H. Chen, H. Jin, J. Sun, D. Deng, X. Liao, Analysis of large-scale topological properties for peer-to-peer networks, in: Proc. IEEE CCGrid’04, 2004, pp. 27–34. [12] Q. He, M. Ammar, G. Riley, H. Raj, R. Fujimoto, Mapping peer behavior to packet-level details: a framework for packet-level simulation of peer-to-peer systems, in: Proc. IEEE/ACM MASCOTS’03, Orlando, FL, October 2003. [13] S. Merugu, S. Srinivasan, E. Zegura, P-sim, A simulator for peer-topeer networks, in: Proc. IEEE/ACM MASCOTS’03, Orlando, FL, Oct. 2003.

199

[14] N.S. Ting, R. Deters, 3LS – A peer-to-peer network simulator, in: Proc. IEEE P2P’03, Sweden, 2003. [15] N. Kotilainen, M. Vapa, T. Keltanen, A. Auvinen, J. Vuori, P2PRealm – Peer-to-Peer Network Simulator, in: Proc. 11th International Workshop on Computer-Aided Modeling, Analysis and Design of Communication Links and Networks, 2006, pp. 93–99. [16] M. Jelasity, A. Montresor, G.P. Jesi, Peersim peer-to- peer simulator, 2004, Avaliable from: . [17] W. Yang, N. Abu-Ghazaleh, GPS: a general peer-to-peer simulator and its use for modeling BitTorrent, in: Proc. IEEE/ACM MASCOTS’05, Atlanta, GA, 2005. [18] A.L. Barabasi, R. Albert, Emergence of scaling in random networks, Science 286 (1999) 509. [19] A. Medina, I. Matta, J. Byers, On the origin of power laws in internet topologies, ACM SIGCOMM Computer Communication Review 30 (2) (2000) 18–28.

Chao Xie currently is a Ph.D. student in the Department of Computer Science at University of Wisconsin-Madison. He obtained his M.S. degree in Computer Science from Georgia State University, USA, in 2007, obtained his M.Eng. degree in Computer Science from Huazhong University of Science and Technology, China, in 2005, and obtained his B.S. degree in Mechanical Engineering from Huazhong University of Science and Technology, China, in 2001. His main research interests include computer networks, distributed systems, parallel computing and data mining. Chao Xie is a member of the Association of Computing Machinery and the IEEE Computer Society.

Guihai Chen obtained his B.S. degree from Nanjing University, M.Eng. from Southeast University, and Ph.D from University of Hong Kong. He visited Kyushu Institute of Technology, Japan in 1998 as a research fellow, and University of Queensland, Australia in 2000 as a visiting professor. During September 2001 to August 2003, he was a visiting professor in Wayne State University. He is now a full professor and deputy chair of Department of Computer Science, Nanjing University. Prof. Chen has published more than 100 papers in peer-reviewed journals and refereed conference proceedings in the areas of wireless sensor networks, highperformance computer architecture, peer-to-peer computing and performance evaluation. He has also served on technical program committees of numerous international conferences. He is a member of the IEEE Computer Society.

Art Vandenberg was born in Grasonville, Maryland, 1950. Education includes B.A. English Literature, Swarthmore College, Swarthmore, PA, 1972; M.V.A Painting and Drawing, Georgia State University, Atlanta, GA 1979; and M.S. Information and Computer Systems, Georgia Institute of Technology, Atlanta, GA 1985. He has worked in library systems, research and administrative computing since 1976, including 15 years in information technology positions at Georgia Institute of Technology. Since 1997 he has been with Information Systems & Technology at Georgia State University, as Director of Advanced Campus Services charged with deploying middleware and research computing infrastructure. His current activities include deploying grid computing solutions and establishing high-performance

200

C. Xie et al. / Computer Communications 31 (2008) 190–200

computing cyberinfrastructure. Recent research grants include a NSF ITR Award 0312636 as Co-PI investigating a unique approach to resolving metadata heterogeneity for information integration by combining monitoring, clustering and visualization to discover patterns or trends. He is a member of Georgia State’s IT Risk Management Research Group, the Georgia State Information Integration Lab, and serves as Chair of SURAgrid, a regional grid initiative of the Southeastern Universities Research Association. Mr. Vandenberg is a member of the Association of Computing Machinery and the IEEE Computer Society.

Yi Pan is the chair and a professor in the Department of Computer Science and a professor in the Department of Computer Information Systems at Georgia State University. Dr. Pan received his B.Eng. and M.Eng. degrees in computer engineering from Tsinghua University, China, in 1982 and 1984, respectively, and his Ph.D. degree in computer science from the University of Pittsburgh, USA, in 1991. Dr. Pan’s research interests include parallel and distributed computing, optical networks, wireless networks, and bioinformatics. Dr. Pan has published more than 100 journal papers with 30 papers published in various IEEE journals. In addition, he has published over 100 papers in refereed conferences (including IPDPS, ICPP, ICDCS, INFOCOM, and GLOBECOM). He

has also co-authored/co-edited 30 books (including proceedings) and contributed several book chapters. His pioneer work on computing using reconfigurable optical buses has inspired extensive subsequent work by many researchers, and his research results have been cited by more than 100 researchers worldwide in books, theses, journal and conference papers. He is a co-inventor of three U.S. patents (pending) and 5 provisional patents, and has received many awards from agencies such as NSF, AFOSR, JSPS, IISF and Mellon Foundation. His recent research has been supported by NSF, NIH, NSFC, AFOSR, AFRL, JSPS, IISF and the states of Georgia and Ohio. He has served as a reviewer/panelist for many research foundations/agencies such as the U.S. National Science Foundation, the Natural Sciences and Engineering Research Council of Canada, the Australian Research Council, and the Hong Kong Research Grants Council. Dr. Pan has served as an editor-in-chief or editorial board member for 15 journals including 5 IEEE Transactions and a guest editor for 10 special issues for 9 journals including 2 IEEE Transactions. He has organized several international conferences and workshops and has also served as a program committee member for several major international conferences such as INFOCOM, GLOBECOM, ICC, IPDPS, and ICPP. Dr. Pan has delivered over 10 keynote speeches at many international conferences. Dr. Pan is an IEEE Distinguished Speaker (2000-2002), a Yamacraw Distinguished Speaker (2002), a Shell Oil Colloquium Speaker (2002), and a senior member of IEEE. He is listed in Men of Achievement, Who’sWho in Midwest, Who’sWho in America, Who’sWho in American Education, Who’s Who in Computational Science and Engineering, and Who’s Who of Asian Americans.