Random walk search in unstructured P2P

Random walk search in unstructured P2P

Journal o f Systems Engineering and Electronics , Vol. 17 , No. 3 , 2006 , p p . 648- 653 Random walk search in unstructured P2P Jia Zhaoqing ’*‘, Y...

505KB Sizes 5 Downloads 79 Views

Journal o f Systems Engineering and Electronics , Vol. 17 , No. 3 , 2006 , p p . 648- 653

Random walk search in unstructured P2P Jia Zhaoqing

’*‘, You Jinyuan ’, RQo Ruonan’ & Li Minglu

1. Dept of Computer Science and Engineering, Shanghai Jiaotong Univ. , Shanghai 200030, P. R China; 2. Dept of Foundation Science, the First Aeronautical Inst of the Air Force, Xinyang 464000, P. R China (Received March 11, 2005)

Abshact: Unstructured P2P has powerlaw link distribution, and the random walk in power-law networks is amlyzed. The analysis results show that the probability that a random walker walks through the high degree nodes is high in the powerlaw network, and the information on the high degree nodes can be easily found through random walk Random walk spread and random walk search method (RWSS) is proposed based on the analysis result Simulation results show that RWSS achieves high success rates at low cost and is robust to high degree node failure. Keywords: unstructured P2P search, random walk search, random walk spread, powerlaw network

1. INTRODUCrrON In the last few years, unstructured PZP applications are very popular, such as Gnutella[” , KazaaC2].They are designed for sharing files among the peers in the

networks. There is no precise control over the network topology or file placement in these systems In general, they employ flooding scheme for searching object, and waste a lot of bandwidth“’. Today , bandwidth comumption attributed to these applications amounts to a COIlsiderable fraction (up to 60%) of the t o d Internet traffiCC’1. It is of great importance to reduce the total traffic of them for the user and the broad Internet community. A search for a file in a P2P network is successful if it discovers the location of the file. The ratio of successful to total searches made is the success rate of the algorithm, The performance of an algorithm is associated with its success rate, while its cost relates to the number of messages it produces PZP have Many search methods for unsbeen proposed with an attention to reduce the overhead of the o r i d Gnutella flooding mechanism Reference [4] pmposed a variation of floodmg searck In this method, nodes randornly choose a ratio of theii neighbors to forward the query to. It certainly reduces the average message production cornpafed to flaxling search But it still contacts a large number of nodes In Ref. [5], authors proposed a random walks search method, The requesting node sends out several query messages to an equal number of randomly chosen neighbors. Each query message is forwarded

to a randomly chosen neighbor node at each step by intermediate nodes In this algorithm, total traffic is largely reduced But its success rate is very low. Reference [S] proposed a search algorithm, which utilized high degree nodes Each node indexes all files on its neighbors, and when a node forwards a query message, it chooses the highest degree neghbor node to forward. In this algorithm, the pmbbility that the files on the low degree nodes can’t be found is high, so success rate is still low. Unstructured P2P networks display a power law distribution in their nodes degreeC7’, so that we analyze the random walk in powerlaw network in this paper. We present a random walk spread mechanism to improve search in unstructured P2P and propose random walk spread and search algorithm. We perform extensive simulations and compare RWSS with the method in Ref. [S], and RWSS achieves great results in the success rates, message production and average hops.

2. RANDOMWALKIN POWER-LAW NETYVORK Random walk is a well-known technique, which forwards a query message to a randomly hneighbor at each step until the object is found This message is b w n as ‘‘randam walker”. In a powerlaw graph, p ( K ) is the probability that a randomly chosen node has degree K , and is given by#@) = K 7 / c k 7 .

A ran-

k

dom edge arrives at a node with probability proportional to the degree of the node, i e ,

p,(k) =kp(k)/CKp(k), k

Random walk search in unstructured P2P where

2kp (k) is equal to average degree . i

For enough high degree k, , when k 2 k, , there is only one node with degree = k in the graph, and its p ( k > = 1 / N , so p1 (k) = k / ( N < k >) Let maximum degreeC8’m= Nilr, a random edge arrives at the highest degree node with probability p1(m)= l/(N(rl)’r k >) For simplicity and relevance to mst real-world networks of interest we assume 2 < K 3

.

.

<

So, the pmbability that a random walker walks through the highest degree node with steps=n is given by

649 lation results and theory values on the two powerlaw graphs: graph I20 000, which is produced by Innet-3. OCgl, and graph P20 000, which is produced by Pajek clO1. Graph I20 000 has 20 000 nodes with m=3 159, =5. 26 and r = 2.3. Graph P20 000 has 20 000 nodes with M = 441, The theory values are given by Eq. ( 1 ) . The simulation results are gained by simulating random walk on the two graphs. For the maximum degree of each graph is much higher than W‘ (20 000’’2~’ = 111. 7 and 20 000’/2~3 = 74.1) , the simulation result is much better than the theory value The simulation results show that random walk can arrive at the highest degree nodes mre quickly in real-world graphs

< k > = 5 . 8 4 and r=2.1.

0.8 1.0

Figure 1 displays the probabilities that a random walker walks through the highest degree node in a 100 OOO node graph with the particular exponent The pmbability is over 70% at n = 5 OOO , and comparing with N = 100 OOO , 5 OOO is very low. The difference between maximum degree and degree of every other high degree node is small, thus the probabilities that a random walker walks through every other high degree node is close to the ones of the highest degree node In real-world graphs one does ohswe nodes of degree higher than MI‘, so that randm walk will arrive at high degree nodes m r e quickly.

j ;::h

0.2

-

0.0

400 600 800 lo00 step - - c : ~ ~ ~ p 2 O o o o ;-: SRonI20000; +:TVonpZOooO, t.:TVonI2OOOO 0

200

3. RWSS Gu;oRFIHM

r

1.0

-

0.0

Based on the analysis of the section 2, random waIk in

4 0

2000

4000

6000

8000 10000’

Stg, -:21;

-:23;

-C:25; -:2.7;

+:29

F;B. 1 Probability v s steps of random walker, for r (top to bottom) =2. 1 , 2. 3 , 2. 5. 2. 7, 2. 9

Figure 2 displays a comparison between simu-

powerlaw networks naturally gravitate towards the high degree nodes If file location information is placed on the high degree nodes, then it will be easily found through random walk, Hence, we design a replica spread method employing random walk for placing files information on the high degree nodes in P2P network, and present RWSS algorithm RWSS includes two phases: random walk spread and random walk search To reduce delay in RWSS, a node sends b messages, and each message takes its own random walk The simulations in Ref. [4] codirm that b walkers after T steps reach roughly the same number of nodes as 1 walker after bT steps For easily describing our algorithm, we give the

650

J i a Zhaoqing , You Jinyuan , Rao Ruonan &- Li Minglu

following definitions Ikfhitial 1 LocalDirectory( LocalDir) : Each node has a LocalDirectory that points the shared files on this node. lkfinitia~2 FileInformationDatabase ( FileInfoDB) is a database. The location information of the files managed by other nodes can be stored in this database. These files are remote ones on the other nodes. An entry of FileInfoDB database is a pair (filename, loc). Each node has a FileInfoDB. Ikfhih3 FileInfmtim(messageld, f;lslmne, loc, d)is a mssagc It is sent by a node to spread loc, the location information of the file filename. MessageId is used to identify a unique message in networks. The message can be forwarded to at most ttl steps. Defhitial 4 (messageld, f;lslmne, #D is a query message sent by a node to query file f ilenanze. The message can be only propagated at most ttl nodes. Ikfiitial5 Response (messugeld, fikmme, loc) is a reply message. If the location information of the target file filename is found on certain node, then a reply message is return to the request node. Field loc indicates the address of the file f ilename. Jkfbitial 6 Neighbors (s) is a node set, which is formed by the neighbors of node s. In the spread mechanism, when a node wants to share its files, it sends out b FileInformation messages to an equal number of randomly chosen neighbors. Each of these messages follows its own path, having intermediate nodes forward it to a randomly chosen node at each step. Module 1 displays the spread mechanism. Module 1 Spread Mechnaism (Node sld spreads the location information of the file filename) Spread Procedure on Source Node s l d : Add filename to LocalDir; f i = New FileInfornation ( f m I d , filename, loc, 1nN) ; ForJ = 1 to b { Randomly choose a node ni , ni € Neighbors(s1d) ; Forward f i to node ni;

}

Spread Procedure on Intermediate Node ni: Receive Filelnformation( f m l d , filename, loc, ttl) from node n l ; If filename not in LocalDir If filename not in FileInfoDB Add (filename, loc) to FileInfoDB;

If ttl>O { Randomly choose a node n2, n2 E Neighbors(ni>; Forward FileInformation(f m l d , filename , loc, tttl) ton2; }

The search mechanism (Module 2) is similar to the spread mechanism The only difference between them is that file information is forwarded in the spread mechanism and query message is forwarded in the search mechanism. Module 2 Search Mechnaism (Node rld locates file filename) Query Procedure on Request Node r Id : If filename not in FileInfoDB For I = l to c { Randomly choose a node nj , nj (Neighbors(rId) ; Forward QuexyRequest (qmId, f ilenmne, ln(N)) to node nj ; }

Else Return Response(nnld, filename, lot) ; Query Procedure on Intermediate Node nj : Receive QueryRequest ( qmld, filename, t t l ) from node n l ; If filename in LocalDir Return Response ( m i d , filename, nj ) to node r l d ; If filename in FileInfoDB Return Response(rmid, filename, loc) to node rld; If ttl>O { Randomly choose a node 722, n2 € Neighbordnj) ; Forward QuegRequest (Qmld,filename, t t t l ) to node n2;

651

Random walk search in unstructured P2P

600 -

700

(TTL) value of query message is set at ln(N), and messages per request is O(bln(N)). 0

4. SIMULATIONS

Grauh size Average degree hhimutndegree

5 000 3.48 1026

10 000 4.12 1800

3

15 000 4. 69 2 500

Average degree

Maximum-

.

J’

20 000 ~

5. 26 3 159

Table 2 l%e second group of graphs (produced by Pqiek) GraDh name Graph size

/ /

500-

P5 000 5 000

P10 000 10 000

P15 000 15 000

P20 000

5. 60 165

5. 83

5.83

312

314

5. 84 441

20 000

4.1 Simulation Results of Random Walk Spread For the average degree of each graph is close to 6, 6 random walkers are used to spread file information. Figure 3 displays replica distribution on the graph 120 000 and the graph P20 000, after location information of l 000 files spread over the two graphs through 6-walker with steps=10. The figure shows that the high nodes of each graph have more information copies, and they form a “directory server group”. Even though the two graphs have different exponent and different nodes degrees, power-law in link distribution makes them have similar replica distribution. Because the degrees of the high degree nodes of the graph 120 000 are much higher than those of the graph P20 000, the probability that random walk anives at the high degree node on the graph I20 000 is higher than that on the graph P20 000, and there are more information copies

Node degree (b) Replica distribution on graph P20 000 Fig. 3 Replica distribution

4.2 Performance of RWSS We focus on three measures: success rates, messages per request and average hops Average hops is the average number of hops per successful request Usually, 16 to 64 walkers give good Hence, 32 random walkers are employed for searching object in simulation Table 3 displays simulation results and shows that RWSS achieves high success rates on two groups of graphs. Each request produces over two hundred of messages on the first group of graphs and over three hundred of messages on the second group of graphs. In contrast to flooding mechanism query cost is very low. The average hops are about 2 on the first group of graphs and less than 5 on the second group of graphs, For the degrees of the high degree nodes of the first group of graphs are higher than those of the second group of graphs, random walk can arrive at them more quickly in the first group of graphs Hence, the performance of RWSS on the first group of graphs is better than that on the second group of graphs.

J i a Zhaoqing , You J i n y w n , Rao Ruonan & Li MingLu

652

Table3 SimnlntiOndtsofRWSS

First group of graphs

Measures Success rates/% Messages per request Average hops

4.3

I5 000

I10 000

I15 000

120 000

P5 000

99.97 239.09 2. 03

99.98 240.08 1. 96

10

100 221.66 1. 76

99.08 315.56 3. 14

232.21 1. 86

Second group of graphs P10 000 P15 000 95.03 330.45 3. 93

88.13 338.10 4. 54

P20 000 81.26 341.48 4. 85

120000, the probability that query message arrives

Load Balance

The ratio of processed requests of a node to total requests made is the load rate of the node. Table 4 displays the load rates of the ten highest degree nodes of each graph and shows that compared with the graph 120000 the loads of the high degree nodes of the graph PZOOOO are largely reduced. For the degrees of the high degree nodes of the graph P20000 are lower than those of the graph

at the high degree node on the graph P20000 is lower than that on the graph 120000. Hence, a better load balance among the high degree nodes can be achieved by tuning the degrees of high degree nodes. But with reducing the degrees of the high degree nodes the success rate is lowered. There is a tradeoff between increasing the success rate and reducing the loads of the high degree nodes.

W e 4 Loedratesofthetmhighestdegreenodes(%) Nodes Graph I20 000 Graph P20 000

4.4

I* 88.3 50.4

2“6 87. 6 43. 3

3”1 82. 9 35. 4

4” 77. 5 34. 3

Robustness to High Degree Node Failure

Usually, the high degree nodes with heavy load are prone to failure. On the one hand, we can prevent the high degree nodes from failure by enforcing that only high capacity nodes become highly connected[”]. On the other hand, our algorithm is robust to high degree nodes failure, because many nodes have file location information. Figure 4 displays the success rates from zero to the ten highest degree nodes failure and shows that when the ten highest degree nodes are failure,

95 -

100

89”-

70

-

Fig. 4 Success rates vs number of failure high degree nodes

5th 76. 2 32. 9

6th 73. 4 29. 1

7th 68. 9 28. 4

8th 66. 5 27. 7

9th 64. 9 27. 5

loth 62.3 24. 9

the success rate is still 81. 47% on the graph I20 000 and 70. 1% on the graph P20 000. With the increasing of the number of failure high degree nodes, the success rates decrease very slowly, 4.5

Comparison with Ref. [ 6 ] Algorithm

Reference [S] algorithm utilized the power-law characteristic of unstructured P2P. In Ref. [ 6 ] algorithm, only one high degree walker was employed for searching target object, so that compared with flooding method and Ref. [5] algorithm, message production is effectively reduced. At the same time, Ref. [S] algorithm adopts replication strategy, so the success rate is improved. RWSS algorithm also utilizes the power-law characteristic of unstructured P2P and adopts replication strategy, so it is similar to Ref. [S] algorithm. Therefore, we compared RWSS with Ref, [S] algorithm. Table 5 lists the simulation results of Ref. [S] algorithm. Comparing Table 3 with Table 5, success rates of RWSS, especially on the second group of graphs, are higher than those of Ref. [S] algorithm and average hops are much lower. In Ref. [6] algorithm, each node indexes all files on its neighbors, and a node chooses

653

Random walk search in unstructured P2P the highest degree neighbor node to forward the query to. The high degree nodes have not any information about the files on many low degree nodes that do not neighbor on them. The probability that the files on the low degree nodes can’t be

found is high, so success rate is low. For the degrees of the high degree nodes of the second group of graphs are very low and they have few neighbor nodes, the success rates on the second group of graphs are very low.

Table 5 Simulation results of Ref. [6] algorithm (search walker with ateps=450)

Second group of graphs

First group of graphs

Measures

Success rates%

I5 000 91. 0

Messages per request Average hops

111.4 63. 3

I10 000 88.2 151.5 84.8

I15 000 85. 7 143 58. 5

5. CONCLUSION In this paper, we analyze the random walk in power-law networks, and point out that random walk can arrive at the high degree nodes very quickly. This paper describes random walk spread mechanism for improving search in unstructured P2P. By spreading file location information, the high degree nodes in unstructured P2P form a “directory server group”, so the files can be easily located through random walk search. We proposed RWSS search method, and extensive simulations show that it achieves high performance at low cost. Finally, comparing RWSS with Ref. [S] algorithm, RWSS exhibits higher success rates for similar message consumption.

REFERENCES [l] Gnutella website: http://gnutella. w g o . cam. [Z] Kazaa website: http: //wwu kazaa. cam. [3] Sandvine Inc. An industry white paper: the impact of file sharing on service provider networks. 2002. [4] Vana Kalogeraki, Dimitrios Gunopulos, Zeinalipour Yazti D. A local search mechanism for peer-tepeer networks. Mclean,VA. ACM 7” lnternational Conference on Information and Knowledge Management, 2002: 300-307. [S] Qin Lv, Pei G o , Edith Cohen, et aL Search and replication in unstructured peer-to-peer networks. 16“ ACM International Conference on Supercamputing , New York, 2002: 84-95. [6] Adamic Lada A , Lukose Rajan M , Puniyani Amit R , et al. Search in power-law networks. Physical Rewiew E , 2001, 64, 046135. [7] Jovanovic Mihajlo A . Modeling largescale peer-tepeer

I20 000 84.2 155.6 44

P5 000 39. 8 313.0 109

P10000 31 342.2 103.5

P15 000 22. 1 368.9 83. 0

P20 000 23. 5 373.1 122.9

networks and a case study of gnutella Master Thesis. 2001. [ S ] William Aiello, Fan Chung, Lu Linyuan A random graph model for massive graphs. Proc. of the ThirtySecond Annual ACM Symposium on Theory of Cornputing, Portland, Oregon ,USA, 2000: 171-180. [ g ] Cheng Jin, Qian Chen, Sugih Jamin. Inet: internet topology generator. Technical Report CSE-TR443-00 , Department of EECS , University of Michigan, 2000. [lo] Pajek website: http: //vlado. fmf. uni-lj. si/pub/networks/pajek/. [ll] Yatin Chawathe, Sylvia Ratnasamy, Lee Breslau Making Gnutella-like P2P systems scalable. Proc. of the ACM SIc;cOMM 2003 Conf. on Applications, Technologies, Architectures, and Protocols for Cornputer Cammunication, Karlsruhe, Gennany, 2003: 407-418.

Jia Zhaoqing was born in 1971. He is a Ph D. candidate His research interests include distributed computing, computer network, and complex network E-mail: jiazhaoqing@hotmaiL com

You Jinyuan is a Professor and a Ph. D. advisor. His research interests include distributed system and computing, object-oriented technique, and software engineering.

Rao Ruonan is a Viceprofessor. His research interests include distributed computing and software engineering.

Li Minglu is a professor and Ph D. advisor. His research interests include web services and grid computing, multimedia computing and biomedicine informatics.