Emerging structures of P2P networks induced by social relationships

Emerging structures of P2P networks induced by social relationships

Available online at www.sciencedirect.com Computer Communications 31 (2008) 620–628 www.elsevier.com/locate/comcom Emerging structures of P2P networ...

352KB Sizes 0 Downloads 82 Views

Available online at www.sciencedirect.com

Computer Communications 31 (2008) 620–628 www.elsevier.com/locate/comcom

Emerging structures of P2P networks induced by social relationships Vincenza Carchiolo, Michele Malgeri, Giuseppe Mangioni, Vincenzo Nicosia

*

Dipartimento di Ingegneria Informatica e delle Telecomunicazioni, Facolta` di Ingegneria – Universita` di Catania, Viale A. Doria 6, 95100 Catania, Italy Available online 21 August 2007

Abstract Networks seem to be the natural way chosen by nature to organise individuals, resources and interactions in an effective and robust structure. Studies about natural networks focused on the central role of emerging structures in distributed environments, and pointed out some properties such as small-world effect and communities which are of the most importance to guarantee a fast and efficient communication among nodes. In this paper we propose a model for P2P networks which mimics behaviours of peers in social and biological networks and naturally evolves to a robust graph of peers with some interesting properties, including small-world effect and community decomposition.  2007 Elsevier B.V. All rights reserved. Keywords: Complex networks; Peer-to-peer; Communities; Distributed management; Emerging structures

1. Introduction A social network may be thought as a network where nodes are human beings and links are built whenever two of them interact. In real world it is possible to observe several kind of social networks, such as acquaintance and collaboration networks (see [12] for an overview). Studies made on social networks have been also applied to nonsocial networks, such as, for example, technological networks (e.g. Internet and the World-Wide Web [2]) and biological networks (e.g. neural networks [23], food webs [8,24]). An interesting aspect of some of the aforesaid networks is that they show the ‘‘small world effect’’, which consists in a small average distance between vertexes in the network, despite the total number of nodes involved. The small-world effect has been discovered by Milgram in 1966 [18] and have triggered-off a large amount of research activity among scientists of different disciplines. In social networks it is common that two of your friends would have a greater probability of knowing each other *

Corresponding author. E-mail addresses: [email protected] (V. Carchiolo), [email protected]. it (M. Malgeri), [email protected] (G. Mangioni), vnicosia@diit. unict.it (V. Nicosia). 0140-3664/$ - see front matter  2007 Elsevier B.V. All rights reserved. doi:10.1016/j.comcom.2007.08.016

than two people chosen at random from the population, on account of their common acquaintance with you. In general, clustering is a property that many networks have in common: it expresses the fact that two vertexes that are both neighbours of the same third vertex have a heightened probability of also being neighbours of one another. This effect is quantified by the clustering coefficient [10]. Its value is 1 for fully connected networks and has typical values in the range 0.1–0.5 in many real-world networks. Studies have highlighted that a network presents a ‘‘small world effect’’ if it has a small diameter and a large clustering coefficient. Many real networks have in common the scale-free property [12]. In particular, if we define the degree of a vertex in a network as the number of the other vertices it is connected to, it has been discovered that there are, typically, many vertices with low degree and a small number with high degree. This lead to a vertex degree distribution that often follows a power-law form. A property which usually appears in social and relationships networks is the community structure. A society is usually divided into many groups, and the community structure reflects the self organisation of a network, where people having certain characteristics in common are more likely to be acquainted. Examples include communities of

V. Carchiolo et al. / Computer Communications 31 (2008) 620–628

jazz musicians [3], communities of scientists working on similar areas of research [12], communities of papers on a single topic in a citation network [6], communities of Web pages on related topics, etc. There is no unique definition of a community but the general idea is that members within a community have more connections within themselves and lesser with members belonging to other communities. Recently Newman [9] has defined a measure, called modularity, that can be used to numerically evaluate the community structure of a network. Currently, modularity is assumed as a good test in many community detection algorithms. The ability to detect community structure in a network can help to understand and exploit these networks more effectively. Many interesting models have been proposed in the last 10 years to solve the problem of obtaining networks with certain given characteristics [2,12]. Those models, mainly based on generating functions, implicitly assume as targets some network properties, such as node degrees, linking probability and assortativness, and build a network which would possess the requested properties at a global scale. But in real networks [5], such as citation and co-authorship networks or many social networks, nobody does supervise the entire process of adding and removing nodes, adding and removing links, raising or lowering weights on links and so on. There is no generation function which obliges a given node to build a link to another one, or which creates or removes nodes to obtain a suitable network layout. All the control and management over network structure is implicit in local choices made by peers and results to be completely distributed and unsupervised. Each peer chooses his neighbours ad links to them looking only at his personal goals and perspectives, without any respect to a predetermined or desired network structure or feature. Nevertheless, those networks reveal many interesting emerging properties, that naturally arise from auto-organisation of peers participating to the network. Peers in a co-authorship network, for example, do not know anything about the small-world distribution of degrees or about the high clustering coefficient of the network they belong to. The network have not been ‘‘programmed’’ to become a small-world or to guarantee a small average path length between peers: those effects are just consequences of simple local behaviours of each peer, combined together with local behaviours of all other peers in the network. This work proposes a model for growing and evolution of a peer-to-peer network inspired by social networks dynamics. We think that emerging structural and topological properties of natural networks, as said before, could be successfully exploited to improve communication among peers and to efficiently retrieve shared resources. The model is called PROSA (P2P Resource Organisation by Social Acquaintances), and has been successfully used to build efficient P2P networks for document sharing [7,21,20,19]. PROSA uses the Vector Space Model (VSM) [16] for resource organisation and retrieval: the same approach

621

has been chosen by other P2P architectures recently proposed, such as SETS [4] and GES [26], since it allows users to issue google-like queries while aiding peers in selecting the best path to forward queries along. The result is a kind of ‘‘semantic-driven’’ routing strategy, where queries are always forwarded to peers that can probably answer them. In Section 2 we discuss related contributions in the field of ‘‘semantic-driven’’ P2P networks; Section 3 explains how PROSA works, discussing involved algorithms; Section 4 summarises simulation results of PROSA, with respect to topological structure and resource retrieval; in Section 5 we discuss the strength of arising communities of peers in PROSA, GES and SETS; Section 6 reports conclusions and a brief insight of future developments. 2. Related works From the first appearance of P2P overlay networks for wide-range applications, many efforts have been made to obtain efficient structures, sharp resource retrieval algorithms and optimised routing protocols. Years have passed since the days of peer-to-peer overlays based on central indexes, such as Napster, where the search strategy was a simple yellow-pages-like system with links to nodes hosting required resources. P2P networks have evolved a lot, exploiting the power of unstructured decentralised architectures (such as Gnutella [27]) or the efficiency of Distributed Hash Tables (DHT) (as Chord [17], Tapestry [25]). Nevertheless, a trade-off among message routing efficiency and usability of searching mechanisms is required: queries in unstructured systems can be expressed in a human-like way, but the underlying searching strategy is often neither efficient nor sharp. On the other hand, resource retrieval in pure DHT systems is really fast, but queries are usually represented by the result of a ‘‘hashing-function’’, loosing any reference to semantics. In the last few years, different approaches have been proposed to guarantee both usability and efficiency of P2P structures. Some efforts have been devoted in the direction of using Vector Space Model (VSM) [16] to represent resources, and this could be a winning strategy: VSM allows queries to be expressed by users in a really natural way, while peers can still use information carried out by State Vectors (SV) to make routing choices. In fact VSM associates a vector to both documents and queries; those vectors contain the TF–IDF coefficients of terms contained in the document or, in case of a non-textual resource, the terms describing the resource itself. In this direction, SETS [4] is a proposal for a P2P structure where peers are divided in ‘‘segments’’ depending on the kind of shared resources. A single super-peer acting as a network manager is charged of assigning peers to the right ‘‘segment’’: nodes in a segment have a given amount of so-called ‘‘short-distance-links’’ linking them to other nodes in the same segment, and a relatively low number of ‘‘long-distance-links’’ to node in other segments. Any issued query is forwarded by nodes to a number of rel-

622

V. Carchiolo et al. / Computer Communications 31 (2008) 620–628

evant segments, and then flooded to all peers of each segment. Peers have to periodically ask the network manager to know which segment they should stay inside, and migration of peers from a segment to another is relatively costly. This structure has many drawbacks: a single point of fault is present (network manager); flooding, albeit limited into segments, is not the best way to search documents, at least with respect to the number of messages exchanged to answer every single query; network segmentation and maintenance is left to the network manager, leading to an almost-static and scarcely adaptive topology. Another attempt to exploit VSM for P2P overlay organisation is made in GES [26], which uses only decentralised algorithms for query routing. Peers exchange information with neighbours about the kind of resources they are sharing; when a query is issued, each peer makes a routing decision looking only at the best matching neighbour (i.e., the one which could probably answer the query): the net result is that queries are usually forwarded along a path of bestmatching peers, until they do not reach a peer which is able to answer the query with a relevant resource, called ‘‘target-node’’. From that point on, the query is flooded to target-node’s neighbourhood, along a predefined radius. Drawbacks of GES include the relatively high amount of hops usually required to reach a target-node and the relevant overhead due to links management procedures, periodically performed by peers in order to link themselves with similar nodes. Both SETS and GES are artificial systems: algorithms used in those networks have been chosen to obtain given network topology and features, and no relevant structure does naturally emerge in such networks. Using the same model for knowledge representation (VSM) and starting from observations of real social networks, PROSA is instead able to build efficient P2P structures where queries can be addressed in a fast and efficient way and where similar peers naturally cluster together forming communities. In the following we report comparisons about PROSA, GES and SETS. 3. PROSA network The most important consideration beneath PROSA is that global properties of many social networks are due to local behaviours of peers. Our assumption is that it is possible to obtain similar emerging global effects in a synthetic P2P structure if local behaviours of peers are similar to real behaviours of people in a social network. For this reason PROSA is heavily inspired by networks of relationships among people, since they have been deeply studied in the past twenty years by sociologists and mathematicians, and because such networks usually have some interesting properties: (i) small average path length among peers, (ii) high clustering coefficient, and (iii) strong community structure. Peers and links in PROSA evolve as people and relationships among people in a social network: each peer links to a certain number of other peers, forwards

query to them and, occasionally, meets new peers and build links to them. Peers send messages only to relatives or neighbours; messages to unknown peers are forwarded through paths that can probably reach them, looking only to local connections and to their knowledge. The result is a network with a completely decentralised and distributed management, which has many structural properties of real life social networks. 3.1. PROSA model A directed graph is the natural mathematical abstraction of a network made of a set of vertexes and a set of links among vertexes. More formally, we can model PROSA as a directed graph: PROSA ¼ ðP; L; P r ; LabelÞ

ð1Þ

P denotes the set of vertexes. Each vertex (in the following also called node or peer) in the network is represented by an element i 2 P, while each l 2 L : l ¼ ði; jÞ represents a directed link from node i to node j, i; j 2 P. Note that links in PROSA are directed, because social relationships are usually not symmetric: a peer could be known by many other peers, but the vice-versa is not always true, as in case of pop-stars or leaders of large movements or political parties. Representing a network as a graph is not sufficient to model its dynamic behaviour: a socially inspired network also needs models for resources, peers and links, and an algorithm for link management and query routing. Informally, Pr and Label, respectively, are a function which associates resources to a peer and a function used to label edges. More details are given in the following sections. 3.2. Resources and queries representation Each node in a social network has a certain amount of ‘‘knowledge’’. Please note that most of the evolution of social networks is strictly related to communicative actions performed by peers in order to obtain help, collaboration, information or other kind of resources from other peers. For this aim we need to define a model for both knowledge and query representation in PROSA. Note that the behaviour of PROSA and algorithms governing its dynamics are independent from the aforesaid models adopted. For this reason, in this chapter we introduce a very general model aiming at describing knowledge and query model used in PROSA. We suppose that all resources owned by peers belong to an appropriate Resource Space R, which is a metric space where a distance function Rel(r1, r2) is defined 8r1 ; r2 2 R. Queries for resources in a PROSA network are terms q 2 Q, where Q is the so-called Query Space. We also define a function Rr;q : R  Q ! R, 8r 2 R, 8q 2 Q which associates a ‘‘relevance value’’ to a couple r, q, that value being a measure of how relevant is a given resource r to the query q.

V. Carchiolo et al. / Computer Communications 31 (2008) 620–628

As stated above, the definition of PROSA is independent from both knowledge and query model adopted, but it is necessary to provide a concrete model in order to really implement the network. In simulations presented later on, we adopted without loss of generality the VSM [16] for modelling documents hosted by each peer and queries issued during documents search. VSM represents document using a state vector of (stemmed) terms called Document Vector (DV) (in this case the Resource Space R  Rn where n is the size of DV). Each term t in the vector is assigned a weight wt,d based on the relevance of the term itself inside the document d. This weight is computed using a modified version of TF–IDF schema [15]. We model a query by means of a so-called Query Vector (QV), that is the VSM representation of the query itself. In this case, the Query Space Q  R, where n is the size of QV. Since both resources and queries are represented by state vectors, we define the relevance of a resource r with respect to a query q as follows: X Rr;q ðr; qÞ ¼ wt;r  wt;q ð2Þ t2r\q

3.3. Peers Each peer participates to a social network sharing resources, and knowledge about resources shared by other peers is used in real life to forward queries. For example, if a student needs information about the Riemann Hypothesis, he will not ask a person at random, since usually people are not involved in advanced math. On the other hand, he could ask his math teacher, and he would probably obtain an answer about Riemann, since all math teacher should know something about him, or should be able to ask to another math professor to get information about Riemann Hypothesis. Starting from this observation, we allow peers in PROSA to be somehow represented by resources they share, so that this information can effectively be used during query routing, as in real life. Since peers share a certain amount of resources, we define a function P r : P ! 2R , which associate each peer to its resources. From the set of resources of a peer, we obtain a short description of a peer knowledge using a function P c : P ! C, where C is a generic space where a sum operation (+) is defined. Pc can be considered as a ‘‘sort of sum’’ of all resources of a given peer; it is used during query routing as explained in the following. In order to allow peers to choose the best neighbour for query forwarding, we define a function Rp;q : C Q ! R which computes the relevance of a peer knowledge with respect to a given query or, in other words, an estimation of how much a peer knowledge is in the same topic of the given query. In particular, for a given peer p, Pc(p) is calculated and this value is used in order to estimate the relevance with respect to a the query q. A high relevance of a peer with respect to a query indicates that

623

the peer could probably provide resources matching the query. Note that the implementation of the relevance function depends on the concrete model adopted for both the Resource Space and Query Space. As said above, we model both resources and queries using VSM. In this case, we adopt the following definition of relevance: X wt;P c ðpÞ  wt;q ð3Þ Rp;q ðP c ðpÞ; qÞ ¼ t2P c ðpÞ\q

3.4. Links Links in PROSA are directed edges l(i, j) connecting a source peer i with a destination peer j. Types and strength of links among peers in PROSA evolve as in real social networks, and they are modelled with three different types of links: acquaintance, full semantic and temporary semantic links. When a peer meets another peer, without knowing anything about his knowledge, interests or skills, he considers it a simple ‘‘acquaintance’’. This kind of link is really weak, it does not requires any information exchange among peers and is usually called an ‘‘Acquaintance Link’’ (AL) in PROSA. It is similar to the link between a baby and his parents, in the first months of his life: he does not know anything about his parents knowledge, culture, jobs, interests, etc. In real life, such kind of weak links are used when a peer does not have links to peers which are relevant for the query, i.e. to peers that can probably answer it. Using an AL is similar to asking at random to people, looking for an information that nobody we know could directly or indirectly provide. The strongest link available in PROSA is the so-called ‘‘Fully Semantic Link’’ (FSL). When somebody gets information from or collaborate with a group of people, he usually knows what kind of knowledge, interests, culture and skills each colleague has. Having this information is of the most importance to route a question ‘‘to the right person’’, as usually done in real life. We suppose that FSLs in PROSA are a consequence of resource sharing among peers, as in real life. A third kind of link is introduced in PROSA to model an acquaintance stronger than an AL and weaker than an FSL, and is called ‘‘Temporary semantic Link’’ (TSL). A TSL appears when a peer has only partial information about another peer knowledge, culture or interests. Each link in PROSA is associated with a compact representation of the knowledge of the target peer, when available: in the case of ALs, no such information is available (modelled with an empty set ;). For TSLs, contents of past queries originated from the target peer are used as a compact description of its knowledge (TPc), beneath the assumption that if a peer is searching for a given kind of resources, it would eventually find and share them, and queries in that field could be forwarded to it, in the future. In the case of FSLs, the associated knowledge is the compact description of the target peer knowledge (Pc), i.e. an element of the space C.

624

V. Carchiolo et al. / Computer Communications 31 (2008) 620–628

Different meanings of links are modelled by means of a labelling function Label: for a given link l ¼ ði; jÞ 2 L, Label(l) is a vector of two elements [e, w]: the former is the link label (e 2 {AL, TSL, FSL}) and the latter is a weight used to model what the source peer knows about the target peer; this is computed as follows: • if e = AL ) w = ;; • if e = TSL ) w = TPc(l); • if e = FSL ) w = Pc(l). See [20,19] for a deeper description. 3.5. Network management and query routing To provide an effective model for social networks, management in PROSA should be completely distributed and unsupervised, as observed in many real networks. For this reason, the query routing mechanism is based only on local choices, and heavily contributes to network management and organisation, since new links among peers arise as a consequence of query forwarding and answering. When a peer joins PROSA for the first time, it acquires a small number of peers and establishes ALs to them. These links are ALs because a new peer does not know anything about its neighbours until he does not ask them for resources. When the newcomer has to perform or forward a query for resources and it has only ALs to other peers, one of them is selected at random and used as next-hop. In order to show how PROSA works, we need to define the structure of a query message. Each query message is a quadruple QM = (qid, q, s, nr) where qid is a unique query identifier to ensure that a peer does not respond to a query more then once; q is the query, expressed according to the used knowledge model (if knowledge is modelled by Vector Space Model, for example, q is a state vector of stemmed terms), s 2 P is the source peer and nr is the number of required results. PROSA dynamic behaviour is modelled by Algorithm 1 and is strictly related to queries. When a user of PROSA asks for a resource on a peer sq, it builds up a query q and specify a certain number of results he wants to obtain nr. This is equivalent to call ExecQuery(PROSA, sq, qm) (where qm = (qid, q, sq, nr)). Algorithm 1. ExecQuery: query q originating from peer sq executed on peer cur Require: PROSAðP; L; P r ; LabelÞ; cur 2 P; qm ¼ ðqid; q; sq ; nr Þ 2 QM 1: Result ‹ ; 2: if cur „ sq then 3: UpdateLink(PROSA, cur, sq, q) 4: end if 5: (Result, numRes) ‹ ResourcesRelevance(PROSA, q, cur, Threshold1) 6: if numRes = = 0 then

7: f fi SelectNextPeer(PROSA, cur, q) 8: if f „ null 9: ExecQuery(PROSA,f,qm) 10: end if 11: else 12: SendMessage(sq,qid,cur,Result) 13: L L [ ðsq ; curÞ 14: Label(sq, cur) ‹ [FSL, Pc(cur)] 15: if numRes < nrthen 16: for all t 2 Neighborhood(cur)do 17: rel fi Rp,q(Pc(t), q) 18: if rel > Threshold2 19: qm ‹ (qid, q, sq,nr  numRes) 20: ExecQuery(PROSA, t, qm) 21: end if 22: end for 23: end if 24: end if

The first time ExecQuery is called, cur is equal to sq and this avoids the execution of instruction #3. Following calls of ExecQuery, i.e. when a peer receives a query forwarded by another peer, use function UpdateLink, which updates the link between current peer cur and sq, as explained in the following. If sq is an unknown peer for cur, a new TSL link to that peer is added having as weight TPc based on the received query message. Note that a TPc can be considered as a ‘‘good hint’’ for the current peer, in order to gain links to other remote peers. It is really probable that the query would be finally answered by some other peer and that the requesting peer will eventually download some of the resources that matched it. It would be useful to record a link to that peer, just in case that kind of resources would be requested in the future by other peers. If cur has a TSL with sq, the corresponding TPc is updated. Finally, if cur has a FSL with sq, no updates are necessary. After link updates, the relevance of the query with respect to the resources hosted by cur is evaluated calling function ResourcesRelevance. In particular, this function selects all the resources hosted by cur which relevance with respect to q is greater than a given threshold. If we use VSM as resources and queries model, the relevance is calculated using Eq. (2). Two possible cases can hold: • If none of the hosted resources has a sufficient relevance, the query has to be forwarded to another peer f, called ‘‘forwarder’’. Subsequent forwards are modelled by ExecQuery, where f become the current peer. The peer f is selected among neighbours by SelectNextPeer, using the following procedure: - Relevance between query q and neighbours of cur with TSL or FSL links is computed. If, for example, we use VSM as resources and queries model, such relevance is calculated using Eq. (3).

V. Carchiolo et al. / Computer Communications 31 (2008) 620–628

- Peer connected with the link having the highest relevance is selected as next peer. - If the peer cur has neither FSLs nor TSLs, i.e. it has just ALs, the next peer is selected at random. • If the peer hosts resources with sufficient relevance with respect to q, two sub-cases are possible: - The peer has sufficient relevant documents to full-fill the request. In this case a result message is sent to sq and the query is no more forwarded. - The peer has a certain number of relevant documents, but they are not enough to full-fill the request (i.e. they are
625

in the network. We estimate PROSA APL measuring the average length of the path followed by all queries. In the literature there exist several definition of Clustering Coefficient. We used the definition given in [23], where the clustering coefficient of a node is defined as: CC n ¼

En;real En;tot

ð4Þ

where n’s neighbours are all the peers to which n as linked to, En,real is the number of links between n’s neighbours and En,tot is the maximum number of possible edges between n’s neighbours. Note that if a peer k is in the neighbourhood of n, the vice-versa is not guaranteed, due to the fact that in PROSA links are directed. The clustering coefficient of the whole network is defined as: 1 X CC ¼ CC n ð5Þ jV j n2V Fig. 1 shows a comparison of the APL of PROSA with the APL of several real networks (see [2]). Moreover, in Fig. 1, we have included the average path length APLrnd and the Clustering Coefficient CCrnd for a random graph (i.e. a graph where two nodes are linked together with a given probability p) with the same size and average degree. See [2] for definition of APLrnd and CCrnd. As shown in Fig. 1, a GES network of 400 nodes reveal an APL and a CC that are very similar to the one of the corresponding random graph. In other words, from the structural properties point of view GES seems behave like a random graph. If we look only at structural properties, SETS display a behaviour very similar to a small-world, having a small APL and a CC of about five times greater than the one of the corresponding random graph. Note that this kind of behaviour is induced by the network management algorithm that is based on the presence of a super-peer which is responsible to cluster nodes in topics segments; this can partially justify the value of CC that can not be considered as an emerging property of the network. Looking at results it becomes clear that PROSA evolves to a small-world as happens for many real networks. Note that the clustering coefficient for a PROSA network is ten times higher than that of a corresponding random graph, while the average path length is just a bit longer than in

Fig. 1. APL and CC of PROSA compared to those of several real networks.

626

V. Carchiolo et al. / Computer Communications 31 (2008) 620–628

a random graph, in accordance with results observed in other real graphs that are small-worlds. The most interesting feature of PROSA is that those values for topological parameters have been obtained using the same dynamics involved in social structures, avoiding to use a complex analytical method to generate a network with those desirable properties. PROSA seems to catch at least a bit of the natural way of building complex and efficient networks. 4.2. Information retrieval in PROSA The most used quality measure of P2P data retrieval systems is the so called Recall, defined as the fraction of matching documents retrieved by the system with respect to the total amount of matching documents. Simulations made on PROSA networks of different sizes [19] show that more than 50% of queries have a recall higher than 50%, while only 10% of queries have a comparable recall if a simple random-walk strategy is used on the same network. Similar simulations made on GES reveal that at least 70% of queries are answered with a recall higher than 50%, while SETS can guarantee a recall higher than 50% for more than 60% of queries. Looking only at recall, PROSA performance seems really poor, if compared with similar algorithms based on VSM. Nevertheless, we should take into account also the ‘‘processing cost’’, i.e. the fraction of nodes in the network which are actually visited to answer each query. Just for example, a simple flooding strategy is able to obtain a recall of 100% for 99% of queries,1 but it requires to visit almost all nodes, with a corresponding processing cost around 100%. Comparing PROSA with GES and SETS and looking also at processing cost leads to astonishing results: 50% of queries in PROSA have a recall higher than 50%, with a processing cost of just 4%, meaning that similarity-based routing is able to quickly and sharply find nodes which are able to answer a given query. On the other hand, queries obtaining a recall higher than 50% in GES require a processing cost higher than 50%, i.e. half the nodes have to be visited to obtain a valuable recall, resembling the poor performances observed with flooding. Similar considerations yield for SETS, where the measured processing cost for queries answered with a recall higher than 50% is ever higher than 60%: a large amount of query messages are exchanged to answer queries, and a high number of nodes are visited by every single query. We can conclude that PROSA has better performances than similar P2P systems based on VSM, since a relatively high recall is obtained with a really low processing cost. This is a desirable feature in real P2P applications, because the main drawback of actual networks is the relatively high 1

Remember that graphs of P2P networks are directed, and some queries could not obtain a recall of 100% because of unconnected or ‘‘blind’’ components.

waste of bandwidth due to inefficient query routing strategies. 5. Communities in PROSA As reported in the introduction to this work, real networks are usually divided in communities of peers, where the main property of a community is that peers belonging to it usually have a greater number or links to other peers in the same community than expected for a random graph [9,2]. In this section we show that PROSA is able to reproduce such structure, and that nodes in a PROSA network are usually organised in communities of similar peers, based on the resources they share. 5.1. Finding community structure The problem of finding community structure in networks is completely different from the well-known and deeply studied ‘‘graph clustering’’ problem, where the target is to find a cut of a connected graph that minimises the number of links between the two sub-graphs. It is important to clarify what we refer to when we talk of a ‘‘community structure’’, in order to avoid confusion. As defined in [14], a community of a graph is a sub-graph where links among nodes belonging to it are more than expected in a random graph, or equivalently, where links among nodes of the sub-graph are more that link to nodes out of the sub-graph. As a result, a community is a heavily connected component of a graph, and this property can be exploited to optimise resource sharing and retrieving. Many algorithm have been proposed in the last four years to discover communities in a network. Some of these algorithms use agglomerative techniques, where peers having similarities with respect to some topological properties are iteratively linked together in the same community [9]. Other algorithms, instead, proceeds in the opposite direction, removing links between nodes and iteratively separating communities of similar nodes [13]. A third family of algorithms use spectral analysis and optimisation, adapting well-known techniques used in classical clustering problems, such as spectral bisection and multidimensional spectral analysis [14]. A couple of recent proposals involve the use of soft-computing techniques to discover communities, such as genetic algorithms. Many recently proposed algorithms are based on the optimisation of ‘‘modularity’’. Given a graph decomposition in C communities, the modularity can be expressed as: Q¼

C X ðeii  a2i Þ

ð6Þ

i¼1

where eii is the fraction of edges that fall within nodes in community i, while ai is the fraction of edges that would fall among the nodes in i if edges fall at random, i.e. if no community structure does exist. The modularity of a graph decomposition is therefore a quality measure of the decom-

V. Carchiolo et al. / Computer Communications 31 (2008) 620–628

position itself, with higher modularity indicating a stronger community structure. In this section we report results obtained running Newman’s Spectral Optimisation algorithm for community detection, in order to reveal the community structure of a typical PROSA network, comparing it with GES and SETS. Graphs are obtained by simulations of an application of PROSA for document sharing [19]. Nodes in networks are sharply divided in two classes, depending on the kind of documents they share. After running the simulation for a sufficient amount of time, typically the time of 10/15 queries for each node, networks show a strong decomposition in two different communities, reflecting original differences in document shared by peers. The same data-set has been used also to grow SETS and GES networks.

627

This model is able to catch many desirable characteristics of really observed networks, such as small-world effect and community structure, while keeping simple and intuitive the mechanisms used for communication among peers and for network management. PROSA has been used as an effective framework to develop P2P networks for resource sharing, obtaining an efficience comparable to real social networks. As explained, mechanisms used for links management and query routing induce a strong community structure, which can explain why searching and retrieval of resources in PROSA is so efficient and fast. Future works include applying PROSA in different fields, from trust management to predictive studies of future evolution of networks under external solicitations.

5.2. Results An analytical way of finding a decomposition of a graph which maximises modularity has been proposed in [14]. This algorithm works on a so-called ‘‘modularity matrix’’, obtained by a trivial reformulation of modularity definition given in (6): 1 X Q¼ ½Aij  P ij dðgi ; gj Þ ð7Þ 2m i;j where Aij are the elements of the adjacency matrix, Pij is the probability that a link exists between nodes i and j in a corresponding random graph and d(gi, gj) is equal to 1 only if i and j belong to the same community. The algorithm consists of finding the most positive eigenvalue kmax of Q and dividing the graph into two communities, according to the sign of the eigenvector vmax corresponding to kmax. As reported in [9], values of modularity higher than 0.3 reveal a strong community structure. The algorithm finds a decomposition of a PROSA network with 200 or 400 nodes in two communities, with a modularity value in the range [0.35, 0.45] depending on network sizes. This means that nodes in PROSA are naturally structured into communities, and this effect is obtained as a (desirable) side-effect of mechanisms used for query routing and links management. The same algorithm, run on GES, is able to find a network decomposition in two communities, but with a modularity of just 0.14, while SETS has a similar decomposition with a higher modularity value (0.39). Strong community decomposition in SETS can be explained by the implemented network management algorithm: as previously said, nodes are grouped into segments, depending on similarities in resources they actually share and usually have a higher number of links to peers in the same cluster than to peers outside it. This induces a modular structure and leads to an artificial community creation which exactly correspond with topic segments. 6. Conclusions This work propose a model of real networks, called PROSA, inspired by social dynamics and behaviours.

References [2] R. Albert, A.-L. Barabasi, Statistical mechanics of complex networks, Reviews of Modern Physics 74 (2002) 47. [3] A. Arenas, L. Danon, A. Diaz-Guilera, P. Gleiser, R. Guimera, Community analysis in social networks, European Physics Journal B 38 (2) (2004) 373–380. [4] M. Bawa, G.S. Manku, P. Raghavan, Sets: search enhanced by topic segmentation. in: SIGIR’03: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 2003. ACM Press, pp. 306–313. ISBN 1-58113-646-3. [5] M. Buchanan, Nexus: Small Worlds and the Groundbreaking Theory of Networks, first ed., W.W. Norton & company, 2002. ISBN 9780393041538. [6] J. Hopcroft, O. Khan, B. Kulis, B. Selman, Tracking evolving communities in large networks, Proceedings of the National Academy of Sciences USA 101 (2004) 5249–5253. [7] V. Carchiolo. M. Malgeri. G. Mangioni, V. Nicosia, Social behaviours applied to p2p systems: an efficient algorithm for resources organisation, in: Second International Workshop on Collaborative P2P Information Systems, COPS 2006, Manchester, 2006. [8] J. M. Montoya, R. V. Sole, Small world patterns in food webs, condmat/0011195, 2000. [9] M.E.J. Newman, M. Girvan, Finding and evaluating community structure in networks, Physical Review E 69 (2004) 026113. [10] M.E.J. Newman, S.H. Strogatz, D.J. Watts, Random graphs with arbitrary degree distributions and their applications, Physical Review E 64 (2001) 026118. [12] M.E.J. Newman, The structure and function of complex networks, SIAM Review 45 (2003) 167. [13] M.E.J. Newman, Fast algorithm for detecting community structure in networks, Physical Review E 69 (2004) 066133. [14] M.E.J. Newman, Modularity and community structure in networks, Proceedings of the National Academy of Sciences USA 103 (2006) 8577. [15] G. Salton, C. Buckley. Term weighting approaches in automatic text retrieval. Technical Report, Ithaca, NY, USA, 1987. [16] H. Schutze, C. Silverstein, A comparison of projections for efficient document clustering, in: Prooceedings of ACM SIGIR, pp. 74–81, Philadelphia, PA, July 1997. [17] I. Stoica, R. Morris, D. Karger, F. Kaashoek, H. Balakrishnan, Chord: a scalable peer-to-peer lookup service for internet applications, in: Proceedings of the 2001 ACM SIGCOMM Conference, 2001, pp. 149–160. [18] S. Milgram, The small world problem, Psychol Today 2 (1967) 60–67.

628

V. Carchiolo et al. / Computer Communications 31 (2008) 620–628

[19] G. Mangioni, V. Carchiolo, M. Malgeri, V. Nicosia, Efficient searching and retrieval of documents in prosa, in: Databases, Information Systems and Peer-To-Peer Computing 2006 – DBISP2P06, 2006. [20] G. Mangioni, V. Carchiolo, M. Malgeri, V. Nicosia, Evaluating the dynamic behaviour of prosa p2p network, in: International Symposium on Parallel and Distributed Processing and Applications 2006, ISPA06, 2006. [21] G. Mangioni, V. Carchiolo, M. Malgeri, V. Nicosia, Self-organisation of resources in prosa p2p network, in: Self-Managed Networks, Systems, and Services – Proceedings of Second IEEE International Workshop, SelfMan 2006, Dublin, number 3996 in LNCS, 2006, pp. 172–174. [23] D.J. Watts, S.H. Strogatz, Collective dynamics of’small-world’ networks, Nature 393 (1998) 440–442. [24] R.J. Williams, E.L. Berlow, J.A. Dunne, A.-L. Barabsi, N.D. Martinez, Two degrees of separation in complex food webs, Proceedings of the National Academy of Sciences 99 (2002) 12913– 12916. [25] B.Y. Zhao, J.D. Kubiatowicz, A.D. Joseph, Tapestry: an infrastructure for fault-tolerant wide-area location and routing, Technical Report UCB/CSD-01-1141, UC Berkeley, April 2001. [26] Y. Zhu, X. Yang, Y. Hu, Making search efficient on gnutella-like p2p systems, in: Parallel and Distributed Processing Symposium, Proceedings of the 19th IEEE International, IEEE Computer Society, April 2005, pp. 56a– 56a. [27] Gnutella website. World Wide Web. Available from: .

Michele Malgeri is Associate Professor in Department of Informatics and Telecommunications at University of Catania. His research interests include distributed system, information retrieval, query languages and formal language. He received a degree with Honours in Electrical Engineering from University of Catania, Italy in 1983.

Vincenza Carchiolo is full Professor of Computer Science in Department of Informatics and Telecommunications at University of Catania. Her research interests include information retrieval, query languages, distributed system, and formal language. She received a degree with Honours in Electrical Engineering from University of Catania, Italy in 1983. She is member of ACM.

Vincenzo Nicosia received his degree in Computing Engineering (2004) and is currently Ph.D. student at Dept. of Informatics and Telecommunications Engineering. His research interests include distributed systems, security, peer-to-peer networks and complex systems.

Giuseppe Mangioni received the degree in Information Engineering (1995) and the Ph.D. degree (2000) at the University of Catania (Faculty of Engineering), where he became a Professional Engineer in 1995. In 1996 he joined the Dept. of Information and Telecommunications Engineering as a contract researcher. Currently he is a contract professor in Computer Networks at the Faculty of Engineering of Catania. Presently, his main research interests concern e-learning, P2P overlay networks and complex systems.