Information Processing and Management 43 (2007) 624–642 www.elsevier.com/locate/infoproman
Search and browse services for heterogeneous collections with the peer-to-peer network Pepper Henrik Nottelmann, Gudrun Fischer
*
Department of Informatics, Faculty of Engineering Sciences, University of Duisburg-Essen, 47048 Duisburg, Germany Received 23 December 2005; received in revised form 19 June 2006; accepted 8 September 2006 Available online 25 October 2006
Abstract Peer-to-peer networks have recently gained importance as environments for digital libraries. In this paper, we describe Pepper, a peer-to-peer network for supporting two important access modes, searching and browsing. Pepper seamlessly combines hierarchical networks with service-oriented architectures. Hierarchical networks improve efficiency by partitioning peers into high-end hubs which are responsible for routing the query, and leaf peers (typically running on desktop PCs or laptops) which host documents. Service-oriented architectures allow for loosely coupled systems, improve extensibility and reusability, and enable integration of the two access paradigms by combination of multiple services. For example, distributed searching and browsing both use the same statistical metadata, and thus the same storage service, for solving their tasks. Ó 2006 Elsevier Ltd. All rights reserved. Keywords: Hierarchical peer-to-peer networks; JXTA; Service-oriented architectures; Resource selection; Scatter/gather clustering
1. Introduction Literature search is one of the major work tasks for scientists. Two paradigms for supporting this task can be distinguished: When the searcher has some knowledge about her information need (e.g., when she knows the title or authors of a particular paper), she poses a query and investigates the returned ranked list of potentially relevant documents. When a researcher needs to get an overview (e.g., over the state-of-the-art in a particular field), it is difficult or even impossible to specify a useful query. Here, the system has to present the whole collection in such a way that a user can easily browse through the available documents. Documents are stored in digital libraries (DLs for short), e.g. the ACM digital library, or personal collections with technical reports. Typically, multiple DLs have to be combined for satisfying an information need. Manual searching in such a distributed environment, i.e. searching each digital library in isolation, is cumbersome and frustrating. Users typically want a single-stop solution which efficiently supports the search process
*
Corresponding author. E-mail address: fi
[email protected] (G. Fischer).
0306-4573/$ - see front matter Ó 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2006.09.006
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642
625
in the underlying libraries and solves the problems arising from heterogeneous metadata schemas (e.g., BibTeX, Dublin Core, MARC). As one such solution, this paper proposes a peer-to-peer network of distributed heterogeneous, reusable services, which is currently implemented in the Pepper prototype.1 In our system, each user can autonomously maintain her own collection of documents which forms the basis of retrieval and browsing. This structure makes it particularly easy to provide, e.g., fulltexts (as PDFs) of personal publications or metadata collections (e.g., in BibTeX style) to the public. Other peers can wrap web portals like ACM. Special services deal with the heterogeneity of collections by transforming queries or documents. Basically, a service provides a modular piece of functionality which can be reused in different applications or for different scenarios: For example, the task of storing and providing statistics about document collections, or the task of transforming documents into different schemas are required for both retrieval and browsing. The Pepper system integrates more than a dozen services which facilitate searching, browsing, or both. For efficiency, the design is based on a hierarchical network, where the peers are divided into two distinct classes, i.e. hubs (directory server peers) and leaves (typically running on a user machine). JXTA-SOAP is employed for communication, although other middlewares can be plugged in as well. The following section describes the proposed architecture and the set of services. Section 3 introduces the concept and generation of statistical metadata of collections, called ‘‘resource descriptions’’. The problems of efficient query routing and of providing browsing functionality are discussed in Sections 4 and 5. In Section 6, we tackle the issue of integrating data from heterogeneous schemas. Section 7 reflects related work in the fields of architectures and distributed retrieval and browsing. 2. A service-oriented peer-to-peer network The architecture of the Pepper prototype is built on the ideas of service-oriented architectures and peer-topeer networks. Fig. 1 depicts the layers which are involved. They are described throughout this section. 2.1. Network topology Pepper borrows an important idea from other peer-to-peer networks like KaZaA or JXTA Search (Waterhouse, 2001): The set of peers is partitioned into two disjoint classes, hubs (sometimes also called super peers or ultra peers) and leaves (also called DL peers): Hubs are static dedicated peers with large memory and high-computation power, which remain online for most of the time. Hubs work as directory peers, which means that other hubs or leaves register (some of) their services at a hub. As a consequence, hubs are, e.g., responsible for query routing. Leaves host user-oriented services. For example, if a user wants to share the documents with others in the network, a search service is running on his PC which allows for retrieval on that local collection (on that ‘‘digital library’’). Leaves are only connected to hubs; there is no direct connection between leaf peers. The hub-to-hub connections can form an arbitrary peer-to-peer network, or the hubs can be arranged in a more specific topology. In Section 4, we compare three different topologies and corresponding resource selection approaches: an arbitrary hub network, a hub HyperCube (Schlosser, Sintek, Decker, S, & Nejdl, 2005), and a DHT-based global index distributed over the hubs (which imposes a ring structure on the hub network (Stoica, Morris, Karger, Kaashoek, & Balakrishnan, 2001)). Other topologies for the hub network are conceivable as well. This family of topologies (with a hub network and leaf peers attached to the hubs) is called a hierarchical network. The advantage of such a topology is the smaller number of peers that are involved in the query routing process, and the lower number of messages required. The impact of churn (arbitrary peers dynamically joining and leaving the network) is reduced, if the hubs, at least, are designed to remain online and stable most 1
http://www.pepper-project.org.
626
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642
Service
Pepper
Service
Service
JXTA–WS JXTA–SOAP JXTA Fig. 1. Layers: Pepper, JXTA-WS, JXTA-SOAP and JXTA. Table 1 Services in Pepper Service name
Service description
Leaves Leaf inventory (LIS) Search Browsing Query transformation Document transformation Description transformation
Manages documents and resource descriptions Provides retrieval on a local collection Provides browsing functionality over all collections Transforms a query between fixed schemas Transforms documents between fixed schemas Transforms resource descriptions between fixed schemas
Hubs Registration Hub inventory (HIS) DTF cost estimation Optimum selection Resource selection (RS) DHT HyperCube Result merging (RM) (Hub) search
Registers and unregisters services Manages resource descriptions Computes cost estimations for neighbour search services Computes optimum selection for given cost estimations Selects most appropriate neighbour search services Maintains the distributed hash table (if used) Maintains the HyperCube topology (if used) Merges results from several search services Calls other services for processing queries
of the time. Together, hierarchical networks scale better than unstructured networks, as the scalability problem is shifted to the smaller hub sub-network. 2.2. Services in Pepper Pepper employs a flexible architecture where services (written in Java) do not depend on the underlying middleware. Any class can be used as a service; another service can invoke all methods which are defined in one or more interfaces. If a service wants to interact with its environment (e.g., to find other services), it has to implement two methods which allow to set and retrieve a ‘‘service environment’’ object. The services found are returned as simple objects; if a remote service method has to be invoked, this is covered by an ordinary local Java method call. A single peer can provide multiple services, which can be retrieved by its name. The Pepper system hosts a heterogeneous set of 15 services (see Table 1). Different implementations of a service interface are possible, e.g. for supporting different routing and retrieval strategies. The functionality of the services is described in the following sections. 2.3. The JXTA-WS framework As the Pepper services do not depend on any particular middleware, the same implementation can be used in different contexts. By default, Pepper uses the JXTA-WS (JXTA web services) framework, a combination of web services and peer-to-peer networks we have created. It is built upon JXTA (Gong, 2001) and JXTASOAP.2 JXTA is one of the state-of-the-art peer-to-peer frameworks and defines a set of open, XML-based 2
http://soap.jxta.org/.
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642
627
protocols for connecting a variety of devices, e.g. PCs, PDAs or mobile phones. The JXTA reference implementation is written in Java.3 The sub-project JXTA-SOAP is used for communication in JXTA-WS, as it allows to send SOAP messages4 through a JXTA network. JXTA-WS dynamically creates ‘‘stubs’’ for remote services. A stub is a class which implements all service interfaces, and redirects calls to the remote peer via JXTA-SOAP. Thus, the SOAP communication is completely hidden to the service developer. JXTA-WS automatically instantiates services on peers and provides convenient configuration support for single services (specifying properties for service classes and their instances) and whole peers (defining the services which are instantiated on each peer). It also adds capabilities for finding services (described by JXTA advertisements) running on the local or on remote peers. Although Pepper seamlessly interoperates with JXTA-WS, the latter is general enough that it can be used in other projects without changes. On the other hand, Pepper could also use alternative middlewares. A simulator, for instance, could provide a functionality similar to JXTA-WS by running several peers in the same program, with direct Java calls to remote services. This could be useful for both network simulations and testing, as a simulation is much more efficient and faster to set up than real networks. 3. Retrieval model and resource descriptions This section introduces the theoretical basis of the search and browse tasks (see Sections 4 and 5): We first describe our probabilistic retrieval model, and then present an efficient way of representing statistical metadata about the collections involved, called ‘‘resource description’’, which is required for distributed retrieval and browsing. In particular for the browsing task, we consider resource descriptions of different granularity, organised in a resource description hierarchy. 3.1. Probabilistic retrieval model Pepper employs a simple, linear probabilistic retrieval model, which boils down to the scalar product in the vector space model. Let ~ dðtÞ ¼ PrðtjdÞ 2 ½0; 1 denote the indexing weight of term t inPdocument d (describing its importance), and ~ qðtÞ ¼ PrðqjtÞ 2 ½0; 1 be the weight of the term in query q with t2q~ qðtÞ ¼ 1. Thus, documents and queries are considered as vectors with a probabilistic interpretation of the weights. Assuming disjointness of query terms (Wong & Yao, 1995), retrieval status values (RSVs) are computed as the scalar product (a weighted linear sum) of the document and the query vector. These RSVs are then mapped onto the probability Pr(reljq, d) that d is relevant to q with a suitable constant c: X ~ RSVðd; qÞ ¼ PrðqjtÞ ¼ qðtÞ ~ dðtÞ; ð1Þ t2q
Prðreljq; dÞ :¼ c RSVðd; qÞ:
ð2Þ
The model can be theoretically justified following Rijsbergen’s paradigm of Uncertain Inference (van Rijsbergen, 1986), by using the interpretations: ~ dðtÞ ¼ Prðt
dÞ;
~ qðtÞ ¼ Prðq
tÞ;
RSVðd; qÞ ¼ Prðq
dÞ;
c ¼ Prðreljq
dÞ
Different indexing schemes can be used, e.g. normalised term frequencies or BM25 (Robertson, Walker, Hancock-Beaulieu, Gull, & Lau, 1992). It has been shown in Nottelmann and Fuhr (2006) that local IDF values, which are computed for each DL autonomously, are outperformed by hub-global ones. The latter are computed on each hub H for each term t from the term’s document frequency in the combination of the DLs which are directly connected to H (i.e., the DLs in the directly neighbourhood of hub H, see Section 3.3). Pepper is able to deal with simple schemas, in which documents are modelled as a linear list of attributes (like ‘‘author’’, ‘‘editor’’, ‘‘title’’ or ‘‘year’’). A popular example of a schema with such a flat, linear structure is Dublin Core (Dublin Core Metadata Initiative, 1999). Such documents can be easily expressed as XML 3 4
Implementations in C and other programming languages are available, but development is far behind the Java version. http://www.w3.org/TR/SOAP/.
628
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642
documents (with only two levels of elements), or as binary relations (as used in Section 6 for declarative definitions of schema mappings). As a consequence, the single document vector ~ dðtÞ has to be replaced by one vector which combines document vectors for each attribute. For simplicity, we ignore this detail in the following descriptions of retrieval and browsing; both access modes can be extended towards linear schemas in a straight-forward way. 3.2. A hierarchical organisation of resource descriptions A ‘‘resource description’’ contains statistical metadata about a collection, which can be exploited for resource selection and browsing. CORI (Callan, Lu, & Croft, 1995), e.g., stores terms and their document frequencies, while resource descriptions have the form of language models in the resource selection approach with the same name. Resource descriptions in Pepper are different: They are considered as probabilistic vectors ~ AðtÞ ¼ PrðtjAÞ 2 ½0; 1 which are the centroids of the descriptions of a set of documents A = {d1, . . . , da}. In other words, the values in the description vector are defined as the average term weights of the documents in the set A: 1 X~ ~ AðtÞ ¼ Prðtjd 2 AÞ ¼ Eð~ dðtÞjd 2 AÞ ¼ dðtÞ: ð3Þ jAj d2A Here, jAj denotes the cardinality of A, i.e. the number of documents in the set. If a term t does not appear in any document in A, then ~ AðtÞ ¼ 0. Resource descriptions are defined at different granularities (‘‘level of abstraction’’): In the easiest case, the set A = {d} contains exactly one document. Then, such a description vector on the document level is equivalent to the document vector (defining the indexing weights of the terms): ~ A ¼~ d. In other words, the document index (i.e., the document vectors ~ dðtÞ) can be included seamlessly into resource descriptions. ~ con The other extreme is a single description vector for the whole collection DL. In this case, the vector DL tains all terms in the collection and the average of the indexing weights in this single collection. In a more general setting, the resource description contains 1 6 n 6 jDLj description vectors, each representing a cluster. If such a cluster C is split into sub-clusters C1, . . . , Cc for which description vectors ~ Ci are already computed, then ~ C can be computed efficiently by directly combining (‘‘aggregating’’) these description vectors: X X 1 ~ Prðd 2 C j jd 2 CÞ Eð~ dðtÞjC j 2 DLÞ ¼ P jC j j ~ C j ðtÞ: ð4Þ CðtÞ ¼ j jC j j j j These different kinds of resource descriptions form a hierarchy (see Fig. 2), with the document vectors as leaves, the collection vector as root, and description vectors computed by aggregating description vectors on lower levels of the hierarchy. By relying on this hierarchical organisation and on a probabilistic retrieval model, we are able to model resource descriptions at different granularities smoothly. The hierarchy can be computed by any hierarchical clustering algorithm, or by repeated application of a partitioning algorithm, provided that the resulting clusters do not overlap. Currently, we employ Buckshot (Cutting, Pedersen, Karger, & Tukey, 1992), a variant of k-means, with the scalar product of the document term vectors as similarity measure. This corresponds to the RSV (Eq. (1)) in our retrieval model, if the terms of one document are treated as a query which is compared to the other document. In addition to the term statistics, we compute a homogeneity score h(Ci) for each cluster Ci, measuring the mean similarity of the individual elements (resource descriptions in the cluster) d to the cluster centroid ci: 1 X simðd; ci Þ: ð5Þ hðC i Þ ¼ jC i j d2Ci
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642
Hub
...
Leaf
629
Hub
...
Leaf
(a) Homogeneous content on leaves
Leaf
Leaf
(b) Heterogeneous content on leaves
Fig. 2. Resource descriptions of different granularity, stored in leaf inventory services and hub inventory services.
This homogeneity score is important for the browsing service (see Section 5). In the more general context of resource descriptions, it is an indicator for the level of abstraction of a given resource description. As described on the document level, Pepper is able to deal with simple schemas. As a consequence, documents and resource descriptions may consist of attributes and separate values for each attribute. For computing the cluster hierarchies, these separate values can be combined in a straight-forward way into one term vector. Thus, the scalar similarity measure can be used for simple schemas as well, and the resource description service will still create one single hierarchy. 3.3. Resource descriptions and cluster hierarchies for hubs The neighbourhood of a hub H is the set of DL leaves which can be visited through H. In a simple yet effective and efficient scenario, this neighbourhood of H contains exactly all leaf peers (with their DLs) which are directly connected to hub H. The description of hub H can be formed by aggregating the resource descriptions of the DLj following Eq. (4). In order to extend the concept of the resource description hierarchy to hubs, we could compute a cluster hierarchy from all the documents in a hub’s neighbourhood. However, when a leaf left or joined this neighbourhood, we might have to touch and adapt each cluster in the whole hierarchy. Therefore, we chose a compromise solution: Instead of single documents from each leaf, we use the leaves’ resource descriptions to compute the hierarchy on the hub. In this case, when a leaf joins or leaves the network, only one path of the hub’s cluster hierarchy will be affected. An abstract representation of such a cluster hierarchy is depicted in Fig. 2a. The squares and hexagons in the figure symbolise semantically similar content. The content on each leaf (DL) is rather homogeneous, so that a leaf’s resource descriptions represents its content quite accurately. In this case, it suffices to use the uppermost layer of the leaves’ cluster hierarchies (i.e., the roots) as a basis for the hub’s cluster hierarchy. On the other hand, if the individual DLs contain rather heterogeneous content themselves, then a leaf description will already be a very generalised representation of the leaf’s content and may not differ much from a neighbouring leaf description. Such a situation is depicted in Fig. 2b. The resource descriptions of the two leaves are similar, while both contain documents about two different topics (symbolised by the squares and hexagons). If those topics are to be present in the hub hierarchy as well, then the hub hierarchy must be computed from a deeper layer of the leaf hierachies. Regardless of the chosen depth, however, the content of the roots on hubs and leaves themselves does not change. In both cases, a leaf description is the aggregation of all document descriptions in its DL, and a hub description is ultimately the aggregation of all document descriptions in the hub’s neighbourhood. Hub descriptions are used for decentralised resource selection (Section 4), where a hub H may store the hub descriptions of its adjacent hubs Hi, in addition to its own hub description. Moreover, the proposed browsing
630
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642
approach requires a global cluster hierarchy, which – in its first implementation – has to be computed from all hub descriptions in the network (see Section 5). 3.4. Resource description services Two closely related services are responsible for creating, storing and handling resource descriptions in leaves and in hubs: Leaf inventory services (LIS) store the documents in a single collection, the resource description with its clusters and description vectors (including the document index, i.e. the clusters with a single document), and provide full documents and resource descriptions upon request. As the name suggests, a LIS runs on a leaf. Hub inventory services (HIS) store resource descriptions of multiple collections (from leaves and hubs) and provide them upon request. HIS are running on hubs. These resource descriptions can be used by a variety of other services. For example, the search service uses the single-document description vectors for providing retrieval capabilities on the collection. The resource selection service (see next section) uses the resource description with a single vector for the whole collection, while browsing (see Section 5) is based on different levels of the cluster hierarchy. So, the same service implementation and the same techniques of creating resource descriptions can be employed for different scenarios. Peer-to-peer networks are characterised by a dynamic behaviour, where peers appear and disappear at any time. In such an environment, transferring resource descriptions becomes a crucial efficiency issue. Several countermeasures are used in Pepper for improved efficiency: Each resource description is quite small, 2 MB on average for leaf peers (for about 26,500 documents). These descriptions can further be pruned (Lu & Callan, 2002), i.e. only the terms with the highest weights are included. For browsing, the same principle was already employed in Cutting et al. (1992). There, the top 50–100 terms sufficed for browsing purposes. Each leaf aims to connect to the same hub it has been connected before, as no new transfer has to be initiated in that case. When a single new document is inserted into a collection, the resource descriptions are kept untouched. Updates are performed only after a certain number of insertions, or after a predefined time limit. When a peer joins a hub neighbourhood, its description is inserted into the hub’s cluster hierarchy following the path of the cluster descriptions which are most similar to the peer description, from the root downward. Thus, of the cluster descriptions used for browsing (see Section 5), only one of the hub’s several top clusters is changed and its description has to be resent for the next request. The hierarchy on the hub is recomputed after a predefined time limit. Peers leaving the network are handled accordingly. When a hub fails, its peers will get a time-out as soon as they try to contact the peer-to-peer network. Then, they will attach themselves to other, available hubs. If a hub leaves the network in a controlled way, then it can notify its leaves, so that they will not need to wait for the time-out. 4. Resource selection It is too expensive to send the query to all search services. Resource selection is the task to decide to which peers a query should be broadcasted, and how many documents have to be retrieved from each selected peer. This decision is based on the resource description introduced in the previous section. In the remainder, we assume that a user specifies the number n of documents she wants to receive; this number is sent in combination with the query q through the network. The detailed task then is to compute, for every DLi, the number si of documents which have to be retrieved from DLi, so that the total number equals a user-specified value n.
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642
631
4.1. Decision-theoretic framework Traditional resource ranking approaches first find the best collections, and then select a constant number of documents from each selected collection (i.e., si 2 {0, const}). In contrast, the decision-theoretic framework (DTF) (Nottelmann & Fuhr, 2003; Fuhr, 1999) returns those si which optimise the overall selection. Costs are introduced in DTF as a general-purpose optimisation criterion: Similar to the probability ranking principle (PRP) (Robertson, 1977), costs are assigned to relevant and non-relevant documents. In other words. retrieval quality is measured as the number ri(si, q) of relevant documents in the result set of si documents. Computation time in the DL (which is often assumed to be constant), and communication time for sending the documents (proportional to the number of documents si) leads to an affine linear function. Monetary costs are mainly per-document charges. Actual costs are unknown at query time, thus expected costs ECi(si, q) for retrieving si documents have to be considered. They are computed as the weighted sum of the expected costs for the different cost criteria; a user can specify a selection strategy (e.g., ‘‘fast results are preferred over high-quality ones’’) by weighting factors: ECi ðsi ; qÞ :¼ crelevance ECi ðsi ; qÞ
relevance
þ ctime ECi ðsi ; qÞ
time
þ cmoney ECi ðsi ; qÞ
money
:
The goal is then to compute a selection vector ~ s ¼ ðs1 ; . . . ; sm Þ which minimises the overall costs: m X ~ ECi ðsi ; qÞ: s :¼ P argmin m i¼1
ð6Þ
ð7Þ
si ¼n i¼1
Retrieval quality, i.e. the expected number E[ri(si, q)] of relevant documents in the result set of si documents, is estimated in two steps: Based on the linear retrieval model introduced in Section 3.1 with query term weights ~ qðtÞ, resource descrip~ and the constant c, the expected number E(reljq, DL) of relevant documents in the whole tion vectors DL collection is estimated as X X ~ ~ Eðreljq; DLÞ ¼ Prðreljq; dÞ ¼ jDLj c qðtÞ DLðtÞ: ð8Þ d2DL
t2q
In a second step, an approximation for the recall-precision function of the retrieval system associated with DL is employed for estimating retrieval quality in the top ranks, i.e. for transforming E(reljq, DL) onto E[ri(si, q)]. Please note that in this context the linear retrieval model uses no metadata apart from the resource descriptions, which consist (as described in Section 3) of one term-weight vector per resource (document or DL) for unstructured text data. In the case of a simple, linear schema, the resource descriptions are composed of one term-weight vector per resource and attribute. The search services return probabilities of relevance Pr(reljq, d), which are directly used for merging results together in one list, without further normalisation. 4.2. Different strategies and topologies for resource selection Based on the principle of hierarchical networks, different topologies can be considered (see Nottelmann & Fuhr, 2006 for a detailed overview): In the default case, no restrictions on the hub sub-network are imposed. Similar to Lu and Callan (2004), resource selection is then applied in a decentralised way: Each hub receiving the query selects those neighbour leaves and neighbour hubs which minimise overall costs, employing resource descriptions of all neighbours (including hub descriptions). Fig. 3a depicts the message flow in the whole network; here, thick lines denote connections over which messages are sent (the numbers indicate the order), while thin dotted lines are other connections. In this example, hub 1 is the ‘‘initial hub’’, the first hub which receives the query. DTF at hub 1 selects one of its neighbour leaves and hub 2 (and forwards the query to these peers), but not hub 3. Hub 2, however, computes a local optimum selection which now includes hub 3. DTFbased selection is deterministic with respect to the initial hub, i.e. barring churn, it yields the same set of
632
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642 Pointers
1
15
1
1
3
7
3
2
1
3
4
1
1
4 5
2
1
5
4
12
0
6
1 2
0
2
13
2 0
2
2
8
2
1
2
Peer
0
2
3
11 10
Term
6
3
(a) Hierarchical net-works
(b) Partitions induced by Hyper-Cubes
(c) DHT
Fig. 3. Resource selection in hierarchical peer-to-peer networks.
documents. For different initial hubs, however, the optimal set of documents may differ, because the costs for obtaining the documents (e.g., communication costs) generally depend on the requesting hub. Cycles in the hub sub-network impose several problems. In particular, efficiency decreases as the same hub can be contacted several times. As one solution, an alternative topology based on HyperCubes (Schlosser et al., 2005) can be employed. A HyperCube is a regular d-dimensional structure, where each peer is connected to exactly d other peers (one per dimension, see Fig. 3b). Messages arriving via a connection on dimension l 2 {0, 1, . . . , d 1} can only be forwarded to peers on strictly higher dimensions l 0 > l. For example, if a search starts on hub 1, this hub 1 can forward the query to hubs 2, 3 and 5. Hub 3 can only forward the query to its DL neighbours and hub 7 when it is contacted by hub 1; the dimensionality of the link to hub 4 is too small. However, if the search starts from hub, and it contacts hub 3 on dimension 0, then hub 3 can forward the query both to hub 1 on dimension 1 and to hub 7 on dimension 2, because the dimensions of those connections are greater than the dimension on which hub 3 was reached. Thus, the dimensions define (starting from an arbitrary peer) a spanning tree on the network, which ensures that there is exactly one path from one peer to another peer. It also corresponds to a clearly defined partition of the whole network, which is exploited for creating resource descriptions. Resource selection is performed as before, where each hub computes an optimum selection of its neighbour peers (which can be reached in the HyperCube topology). A third option is to employ distributed hash tables (DHTs). Similar to traditional hash tables, they provide fast insert and lookup functionality for keys based on hash functions. The DHT system Chord (Stoica et al., 2001) (see Fig. 3c) maps peers and keys onto numbers using the same hash function. Peers are ordered in a ring according to the hash values, and each peer is responsible for storing the values for all keys mapped onto its hash value (and all values higher than the preceding peer). In Fig. 3c, one term which is mapped onto the hash value 4 is stored on the peer whose key has the same hash value. In contrast, another term is mapped onto hash value 2, which does not correspond to any peer. Thus, peer 4 is responsible for storing that term, too. Each peer maintains a ‘‘finger list’’ of peers in exponentially increasing distance, which contains all known peers and allows for efficient routing in Oðlog mÞ hops for m peers (ignoring the hub-to-hub connections in the two other topologies): If peer 1 wants to retrieve the value for the term mapped onto hash value 12, then it use its finger list to contact peer 11, which is the closest known peer preceding peer 13 (which is responsible for that key). For Pepper, a natural choice is to use hubs only for the DHT, to use terms t as keys and the corresponding description vector weights ~ CðtÞ as values (i.e., inverted lists). Once the inverted lists are fetched, a centralised selection takes place. This contrasts other DHT approaches, where either the whole document index is kept in the DHT (Harren et al., 2002), or DHTs are used for resource selection in unstructured networks (Minerva, Bender, Michel, Zimmer, & Weikum, 2004). A variant only stores hub descriptions in the DHT (yielding much smaller inverted lists), so that an additional selection step is required at selected hubs. The advantage of DHTs is that cost estimations can be collected faster, as only a small number of lookups is required. 4.3. Resource selection services Each leaf hosts arbitrarily many leaf inventory services and attached search services. Hubs host several service instances which are involved in query processing:
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642
633
A DTF cost estimation service returns for a given query a cost estimation of all known neighbour search services. An optimum selection service receives cost estimations, and returns an optimum selection. A resource selection service in general returns a selection. Different variants for the three topologies are employed, based on the DTF cost estimation and the optimum selection service. However, a different (e.g., CORI-based) selection could easily be implemented by creating a new service instance which is independent from these two services. A result merging service receives the results from different search services and returns a single ranked list. Currently, the documents are re-ranked according to their original weights (probabilities of relevance). A (hub) search service receives the query, just like a search service backed by a DL, and utilises other services on the hub and on neighbour peers for retrieving results. The hub inventory service instance stores one description vector for each collection, together with further information like the number jDLj of documents in the collection or the constant c. When distributed hash tables are employed for a faster centralised selection, a DHT service is responsible for maintaining the distributed hash table. Alternatively, a HyperCube service might be used for creating the HyperCube topology (if required during runtime). In environments with heterogeneous schemas, schema mapping services have to be employed (see Section 6 for a detailed description).
4.4. Experimental results The decision-theoretic framework has been proven to be effective, and to yield competitive results compared to CORI (Nottelmann & Fuhr, 2003). This section reports the quality of DTF applied to peer-to-peer networks (an extended evaluation is carried out in Nottelmann & Fuhr (2006)). Experiments have been conducted on a large test-bed: 2500 collections (each on a leaf) and 25 hubs, which have been created from part of the WT10g collection (Lu & Callan, 2003). The WT10g collection was divided into 11,485 subcollections according to the document URLs, and the 2500 leaf collections were chosen randomly from this set of subcollections. Similar collections were assigned to the same hub (per hub, at least 13 and at most 1013 leaves), while the connections between the hubs were generated randomly, with each hub having between 1 and 7 neighbour hubs. The overall test-bed is rather heterogeneous. One thousand short queries are applied onto this network. As no relevance judgements are given, pseudo-relevance data is used, created from combining all 2500 collections into one centralised collection, ranking the documents w.r.t. their probabilities of relevance for a given query, and marking the 50 top-ranked documents as ‘‘relevant’’. Thus, the experiments measure how well distributed retrieval with the different topologies and resource selection strategies approximates a centralised collection. The decentralised selection used in Pepper has been compared to a centralised selection (both, however, on the distributed collection). In the latter, the query is flooded through the hub-sub-networks, and each contacted hub returns cost estimations of all neighbour leaves. Finally, a single selection is performed based on all collected cost estimations. Thus, only the cost estimation is distributed, computing a selection and contacting the selected DLs is centralised. Results are depicted in Table 2. Not surprisingly, centralised selection outperforms the decentralised variant in terms of effectiveness: Precision in the top ranks, average precision and set-based recall and precision are 6–13% worse for decentralised selection. However, an important advantage of decentralised selection lies in its efficiency, as it does not contain an expensive flooding stage (where nearly 72 hops are required for this cost collection phase) Thus, decentralised selection requires 42% less hops. HyperCubes and DHT perform worse in the top ranks, but outperform the centralised hierarchical network variant in lower ranks, in terms of mean average precision (MAP) and set-based recall and precision. A comparison of our results with other methods is difficult, as these experiments are not comparable. However, the decentralised DTF selection approach seems to be competitive to the results reported in Lu and Callan (2004) and Lu and Callan (2003). Future work should concentrate on a direct comparison of the different approaches.
634
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642
Table 2 Resource selection results Centralised
Decentralised
HyperCubes
DHTs
P@5 P@10 P@15 P@20 P@30 MAP
0.8032/+0.0% 0.6630/+0.0% 0.5562/+0.0% 0.4793/+0.0% 0.3792/+0.0% 0.2586/+0.0%
0.7058/12.1% 0.5721/13.7% 0.4896/12.0% 0.4295/10.4% 0.3456/8.9% 0.2391/7.5%
0.7528/6.3% 0.6411/3.3% 0.5622/+1.1% 0.5079/+6.0% 0.4309/+13.6% 0.3554/+37.4%
0.7546/6.1% 0.6468/2.4% 0.5723/+2.9% 0.5186/+8.2% 0.4405/+16.2% 0.3670/+41.9%
Precision Recall
0.2706/+0.0% 0.2582/+0.0%
0.2515/7.1% 0.2417/6.4%
0.3254/+20.3% 0.3256/+26.1%
0.3326/+22.9% 0.3328/+28.9%
#Hops
103.3/+0.0%
59.9/42.0%
57.1/44.7%
60.6/41.4%
5. Browsing The browsing service in Pepper enables Scatter/Gather browsing (Cutting et al., 1992) on the whole distributed document collection. Scatter/Gather browsing is a highly interactive, cluster-based browsing paradigm in which an overview of the whole dataset or of a selected subset is generated by clustering and presented to the user. The user can then again select parts of the overview (or the whole) for further inspection, and so on. Fig. 4a shows the general principle. A given data set, the focus set, is clustered by the system. This is the so-called scatter step, depicted by the dashed lines in the figure. The user selects one or more of the presented clusters (solid lines), which are then merged (gather step) and form the new focus set, etc. Eventually, the user reaches a sufficiently small focus set which mainly contains relevant items. Cluster-based browsing is an alternative to search, in cases where the user has little knowledge of the document collection. In that situation, she may not be able to formulate a useful query, and will need to gain an overview of the data first. Providing overviews is therefore the main objective of Scatter/Gather browsing in Pepper. In contrast to other top–down browsing approaches, e.g. a browsing through a classification hierarchy, the user is not restricted to a pre-defined structure, but decides for herself, in each gather step, which partitioning of the data is most meaningful to her. This is particularly useful in heterogeneous document collections where there might be several ways to cluster documents together to build an overview. The user herself can steer the process easily in each interaction step, simply by selecting the most promising parts of the data.
Global 3rd (global) tier ...
...
Hub
Hub 2nd tier (on hubs) ...
...
...
...
...
Leaf
Leaf
First tier (on leaves) ...
(a) Scatter/Gather browsing
...
...
...
...
...
...
...
(b) Three-tier hierarchy
Fig. 4. Scatter/Gather browsing in Pepper.
...
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642
635
5.1. A three-tier hierarchy In pre-processed Scatter/Gather (Cutting, Karger, & Pedersen, 1993), it was shown that the clusterings do not have to be computed directly from the underlying documents. It suffices to consider representatives of datasets instead, which contain the most important statistical properties. Cutting et al. (1993) used cluster summaries from a pre-computed cluster hierarchy. In Pepper, we already have a hierarchy of resource descriptions on each leaf peer and on the hubs, which can serve the same purpose. In contrast to Cutting et al., however, we do not generate one global cluster hierarchy for the document collection, which would be impossible in the highly dynamic environment of a peer-to-peer network. Instead, as described in Section 3, the resource description hierarchy is calculated separately for each leaf and represents the content of the leaf only. Similarly, the hierarchy on a hub represents the content in its neighbourhood only, building on the description hierarchies on its leaves. Therefore, the resource description hierarchy for a hub neighbourhood already contains two tiers: the cluster hierarchies on the leaves, and the cluster hierarchy on the hub (which is formed from the topmost clusters on the leaves, as described in Section 3.3). For an overview of the whole content available at a certain point of time, we introduce a third, global tier. For the current, preliminary implementation of the browsing service, we assume that each hub knows all other hubs in the network.5 Each hub collects the descriptions of all other hubs, and combines them into a global hierarchy. Fig. 4b shows the resulting three-tiered hierarchy of resource descriptions, where each tier consists of the cluster hierarchies on that level (leaf, hub, or global). In smaller networks, the global hierarchy may be computed just in time for a browsing request, while in larger networks, it should be precomputed and updated or recalculated in intervals. Due to peer dynamics (peers joining and leaving the network) and content dynamics (documents being added, deleted, or changed), the resource description hierarchies on hubs and leaves have to be updated continuously. In Section 3.4, we listed some countermeasures to reduce the effort of these updates and keep changes as local as possible. However, the effectiveness of these measures, and their limitations, will have to be investigated in more detail in dedicated experiments.
5.2. Scatter/Gather browsing in Pepper When the user starts browsing, the browsing service (running on a peer) requests the initial overview over the whole collection from its hub. The hub inventory service sends the uppermost layer of its global resource description hierarchy, which is presented to the user as a set of clusters. The user selects one or more clusters, the focus set, for further inspection. Using the homogeneity scores h(Ci) (Eq. (5)), this focus set is expanded by replacing the most heterogeneous resource descriptions with their children, i.e. with a higher number of resource descriptions for the same set of data, representing the data in more detail. The focus set is expanded further by the same method, until its size reaches the maximum number of elements which can be clustered fast enough for online browsing. This expansion step is very similar to the expansion step in Cutting et al. (1993), except that we expand the most heterogeneous clusters rather than the largest ones. Fig. 5 shows the expansion step in a simplified example: Let a given focus set consist of only one cluster with resource description A (homogeneity 0.08). This cluster may have been created by online clustering in the previous scatter step, or it may already stem from a pre-computed resource description hierarchy from a hub or leaf peer. Let the maximum number of items in the focus set, before it has to be clustered for representation, be 5. As the focus set in this example has not reached its maximum size yet, the resource description with the lowest homogeneity score, i.e. A, is selected for expansion. A is removed from the focus set and replaced by its children B and C, which are requested from the appropriate inventory services. Now, the focus set has a size of two, which is still below the maximum. Therefore, the resource description with the lowest homogeneity score, B, is selected and replaced by its children D and E, and the process continues until the 5
Other approaches make the same assumption (e.g., Klampanos & Jose, 2004) but of course, better solutions will have to be investigated.
636
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642 1
2
A 0.8
A 0.08 4
B 0.1
C 0.5 3
D 0.7
E 0.2
H 0.6
F 0.9 I 0.8
(a) Expansion sequence
G 0.6
B 0.1
C 0.5
D 0.7
E 0.2
C 0.5
D 0.7
H 0.6
I 0.8
D 0.7
H 0.6
I 0.8
D 0.7 0.5 I 0.8 C 0.5 F 0.9
F 0.9
H 0.6 G 0.6
(b) Expanding focus set
0. 4
G 0.6
(c) Clustered focus set
Fig. 5. Example of focus set expansion by homogeneity.
focus set contains five items. The numbers at the resource description symbols in Fig. 5a indicate the whole expansion sequence, and Fig. 5b shows the changing focus set. Finally, the focus set contains the resource descriptions D, H, I, F, and G. Note that these resource descriptions represent the same set of documents as A, but in more detail. As soon as the focus set has reached its maximum size, the resource descriptions in the focus set are clustered online and presented to the user for further browsing. In the example, the focus set is partitioned into two clusters (Fig. 5c), and the user would therefore see two resource descriptions from which to select the subset for the next browsing step. 5.3. Discussion Presenting an overview of the whole distributed content to the user is an inherently more difficult task than an explicit search. Therefore, for the browsing service, several problems remain to be solved. The most basic question to our proposed approach for browsing is: Can a tiered hierarchy of resource descriptions (see Section 5.1) work as well as the global cluster hierarchy in Cutting et al. (1993)? In first experiments (Fischer & Nurzenski, 2005), we compared a global cluster hierarchy (configuration ‘global’) to two tiered hierarchies, one where the leaf peers were assigned to hubs randomly (configuration ‘random’), and one where leaves with similar content were clustered together and assigned to the same hub (configuration ‘cluster’). Both tiered hierarchies performed nearly as well as the global hierarchy, and we deem these first results encouraging enough to continue the approach of tiered hierarchies and investigate its problems and possible solutions in more detail. In particular, future experiments should clarify why there was hardly a difference between the ‘random’ and ‘cluster’ configurations (i.e. between the two tiered cluster hierarchies, which differed in the setup of the network only). A possible explanation might be that the leaf clustering was not effective, so that the sets of leaves which were gained by clustering were not significantly different from the sets of leaves which were chosen randomly. This means that the cluster hypothesis should be checked with respect to the leaf clusters. Still, as both tiered hierarchies were reasonably close in performance to the centralised, global setup, this unresolved question does not refute our optimistic judgement of the approach of tiered hierarchies as a whole. Instead, it should be taken into account in future experimentation. Another important issue is communication costs. During browsing, the limitation on the size of the focus set guarantees that only a limited number of peers have to be contacted during the expansion phase. The number of resource descriptions which have to be transferred for this step depends on the desired size of the focus set. Cluster resource descriptions are retrieved only as far as necessary (with the IDs of their direct children instead of the complete underlying hierarchies). Furthermore, hubs can cache resource descriptions. Each hub remembers which resource descriptions have changed (due to leaf peer dynamics), and when. The requesting hub which collects resource descriptions for browsing can send a time stamp with the request, and only
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642
637
those descriptions which have changed since then are sent again. In this context, the impact of churn will have to be considered in more detail as well. Still, in comparison with the resource selection scenario, the proposed solution for browsing has a major drawback: The computation of the initial overview requires contacting all hubs in the network and then clustering their resource descriptions online. Although the resource descriptions can be pruned for this purpose to a very small number of terms, this approach is not scalable beyond a certain number of hubs. To ameliorate this problem, the overview over the hub descriptions could be created regularly and offline, instead of online and for each browsing session individually. This approach would be based on the supposition that the content of a hub region changes slowly even if the individual peers show more dynamics. Even then, however, the hubs would still have to maintain a list of all other hubs. Instead, hub descriptions could be spread through the network by gossiping, for instance, or we could drop the assumption that the entire content in the network must be represented, in favour of obtaining a representative selection of the available content. The latter idea would lead to a kind of resource selection for hub descriptions, in analogy to the resource selection approach for documents in Section 4. Whether that solution is preferable, this question depends primarily on the user’s needs in browsing. Hence, different browsing approaches are currently under research. Finally, apart from heterogeneity on the content level, the documents in a peer-to-peer network may also show structural heterogeneity, i.e. different schemas. In this case, resource descriptions and documents have to be translated to the browsing schema (the schema which the user selected for the browsing session). This is the task of schema mapping services (described in the next section). 6. Schema mapping As mentioned in Section 3, documents with simple schemas are modelled as a linear list of attributes. Examples for such schemas are Dublin Core or BibTeX. Schema mapping services assist other services in heterogeneous environments. When collections employ different schemas, documents and queries have to be converted. This task is accomplished by three kinds of schema mapping services: query transformation services (for converting the user query into the DL schema), document transformation services (for results) and resource description transformation services (for resource selection and browsing). Each schema mapping service is specialised in mapping from one fixed schema onto another fixed schema. Multiple schema mapping services have to be chained to map between two schemas for which no direct mapping is available. Pepper employs the sPLMap framework (Nottelmann & Straccia, 2005), which allows for uncertain mappings. These are required when schemas of different granularity are investigated. For instance, if a schema with the general attribute ‘‘creator’’ has to be mapped onto a schema with two attributes ‘‘author’’ and ‘‘editor’’, there is no precise mapping. The sPLMap framework follows a declarative approach for human-readable schema mappings. The schema attributes are mapped onto relations in probabilistic Datalog (Fuhr, 2000), a probabilistic extension to Horn logics. Queries are logical rules. Schema mappings are defined by rules in a declarative way, e.g.: 0:7 authorðx; yÞ
creatorðd; xÞ;
ð9Þ
0:3 editorðx; yÞ
creatorðd; xÞ:
ð10Þ
These rules state that a creator is an author with probability of 70%, and an editor with probability of 30%. These simple rules can be augmented if the query transformation also has to include search operators (when a specific search operator is not available in the target schema, e.g. soundex similarity for author names), or when the values have to be converted (for instance between different date formats, e.g. from ‘‘2004-31-12’’ to the German ‘‘31. Dezember 2004’’). For document transformation, documents have to be converted into pDatalog facts. Then, the rules are applied, and the resulting facts are converted back into documents (which now might have a probabilistic weight for each attribute). Queries, on the other hand, have to be unfolded using the mapping rules. For example, a query condition referring to ‘‘creator’’ has to be replaced by two conditions for ‘‘author’’ and ‘‘editor’’:
638
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642
qðdÞ
creatorðd; \John Doe"Þ #
0:7 q0 ðdÞ 0
0:3 q ðdÞ
authorðd; \John Doe"Þ editorðd; \John Doe"Þ
7. Related work This section reflects research which is related to this paper w.r.t. to the architecture or its major components (resource selection and browsing). 7.1. Database-oriented query routing in peer-to-peer networks Most research on peer-to-peer networks has focused on database-oriented approaches for sending (‘‘routing’’) a query through the network, which ignores the degree of usefulness of a resource (contrasting traditional IR-based resource selection techniques, see Section 7.2). Early unstructured peer-to-peer networks do not scale: Napster (Saroiu, Gummadi, & Gribble, 2002) with its centralised document index has a single point of failure. Flooding the query through the whole network like in Gnutella (Ritter, 2001) imposes too high a load. The effect of flooding can be reduced in hierarchical networks like KaZaA6 or JXTA Search (Waterhouse, 2001), which partition the peers into hubs (super-peers) and leaves. This architecture is also used, for example, in the more recent project P2P-DIET (Idreos, Koubarakis, & Tryfonopoulos, 2004) which combines ad-hoc search and continuous querying in super-peer networks. Structured networks, on the other hand, have a regular topology which can be exploited for query routing. Distributed hash tables (DHT) like Chord (Stoica et al., 2001) guarantee efficient lookup. A hash function is applied onto keys (terms and peer identifiers), each peer stores values for whose hash values are in the interval between the preceding peer and itself. DHTs can be used to implement semantically global structures on a distributed physical basis. ODISSEA (Suel et al., 2003), for instance, uses a DHT structure to store a global inverted term-document in a peer-topeer network, while GridVine (Aberer, Cudre-Mauroux, Hauswirth, & van Pelt, 2004) implements a semantic overlay for managing and mapping RDF triples and schemas, over a distributed, DHT-based physical layer. Another approach for efficient routings are HyperCubes (Nejdl et al., 2003), symmetric regular networks. HyperCubes assign a dimension with each connection, and allow routing only over dimensions larger than the one over which the message arrived. Thus, they impose spanning trees on the network and thus avoid cycles. All three approaches – hierarchical networks, DHTs and HyperCubes – have been described in detail in Section 4.2. 7.2. IR-based resource selection Resource selection is the task to decide to which digital libraries (or peers) a query should be forwarded, and how many documents have to be retrieved from each selected DL. In contrast to the query routing methods described in the previous section, the decision is not based on the network topology alone, but rather on an estimation of the usefulness and in some cases, the potential costs of contacting a certain peer. For federated digital libraries with their centralised selection component, this problem has been investigated for about a decade. Different to Pepper’s decision-theoretic framework, most of the other selection algorithms rank the digital libraries (DLs) w.r.t. their similarity to the query, and retrieve a constant number of documents from the top-ranked libraries. Popular examples are the INQUERY-based method CORI (Callan et al., 1995) and the language model approach presented by Si, Jin, Callan, and Ogilvie (2002). PlanetP (Cuenca-Acuna, Peery, Martin, & Nguyen, 2002) is a peer-to-peer adaptation of the resource selection con6
http://www.kazaa.com/.
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642
639
cepts from CORI and GlOSS (Gravano, Garcia-Molina, & Tomasic, 1994) and relies on gossiping to spread peer descriptions through the network. It is the basis for the Rumorama protocol (Mu¨ller, Eisenhardt, & Henrich, 2005), which allows peers to store a self-selected number of peer descriptions instead of a complete termto-peer index, but does not take user-defined costs into account. Other recent projects like GALANX (Wang, Galanis, & DeWitt, 2003) and MINERVA (Bender, Michel, Triantafillou, Weikum, & Zimmer, 2005) use DHTs to distribute a global term-to-peer index, which is used for centralised resource (peer) selection. MINERVA balances the usefulness of a distant peer against the communication costs. Pepper generalises this concept in the decision-theoretic framework, allowing cost estimation for (possibly many) different aspects, for which the user may define different degrees of importance. Another problem connected with resource selection is merging the results from selected DLs. In CORI, the library score is used to normalise the document score. The language model approach computes final scores based on the original (collection-biased) document probabilities, the DL scores and a smoothing factor. The quality of this approach is slightly better than CORI. The language model approach has been extended towards hierarchical peer-to-peer networks with decentralised selections on each hub (Lu & Callan, 2003; Lu & Callan, 2004). Hub descriptions are formed by aggregating the descriptions of DLs in a neighbourhood, where the influence of term frequencies of distant DLs is exponentially decreased. All of these IR-based resource selection techniques require resource descriptions, e.g. document frequencies for CORI or a collection language model. Resource descriptions can be provided by the collection itself, or discovered via query-based sampling (Callan & Connell, 2001). 7.3. Cluster-based browsing The browsing method which we propose for our peer-to-peer architecture is an adaptation of pre-processed Scatter/Gather (Cutting et al., 1993). In the previous approach, one global cluster hierarchy is computed offline, and most of the browsing and online clustering then uses the cluster descriptions, instead of the contained documents. To the best of our knowledge, Scatter/Gather has not previously been generalised to the highly distributed environment of peer-to-peer networks. Content-based clustering as such, on the other hand, has already been employed in peer-to-peer networks as well. In Eisenhardt, Mu¨ller, and Henrich (2003), a distributed version of the k-means algorithm is performed in a peer-to-peer network, profiting from the parallelisation thus made possible. While the documents do not have to be transferred to one single peer for clustering, cluster centres are propagated to all participating peers in each clustering step. The approach does not deal with peer dynamics. Klampanos and Jose (2004) apply clustering on two levels. The documents on each peer are clustered in order to find the topics present on the peer. On joining the network, the peers themselves are assigned into context-aware groups (CAGs), which form an overlapping clustering on the basis of the topics on the peers. The centroids of the resulting CAGs are then used for query routing. Both the clustering of peers into CAGs, and the query routing are tasks performed by hubs which have to maintain a list of all CAG centroids. Both approaches use a local clustering of documents as well as a global list of cluster centroids. Yet, neither one addresses browsing. In Sections 3 and 5, we followed similar principles to create a distributed, three-tiered cluster hierarchy for Scatter/Gather browsing. 7.4. Architectures Service-oriented architectures (SOA) become more and more popular. Services for managing knowledge bases, planning purposes and semantic registries are provided in a hierarchical peer-to-peer network in Gioldasis, Pappas, Kazasis, Anestis, and Christodoulakis (2004). Hing and Sølvberg (2004) combine SOA digital libraries and the peer-to-peer network software JXTA. As Pepper, they use distributed hash tables, but employ Hilbert Space Filling Curves instead of Chord, and the DHT stores service descriptions. Content-based resource selection is not supported. Furmento, Hau, Lee, Newhouse, and Darlington (2003) present a general service-oriented architecture where the same service implementation can be used on top of the different transport layers, including Jini, JXTA and the Open Grid Service Infrastructure (OGSI). They claim that the major
640
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642
problems with JXTA are lack of security and – more important for digital libraries – performance, and recommend the usage of Jini.
8. Conclusion and outlook This paper describes the Pepper system, a peer-to-peer system for information retrieval in federated digital libraries. The flexible Pepper architecture seamlessly combines a service-oriented architecture with hierarchical peer-to-peer networks. The current implementation uses JXTA-SOAP as its basis, but this middleware could be replaced by others. On top of this architecture, several services provide two information access modes, searching and browsing. Searching and browsing employ the same statistical data about the digital libraries. Hence, a common service collects and stores these statistical descriptions of collections, and can be utilised for both access modes. This shows one of the major advantages of our approach: Services can be reused for different aspects of the search process, namely retrieval vs. browsing. Another advantage of our architecture is that it is particularly easy to replace a service implementation by a new, better one as long as the service interface remains unchanged. For example, retrieval in hubs is split into different services like the resource selection service, the result merging service and a common hub search service which calls the other services. Thus, a new resource selection approach can be easily integrated by replacing that single service, keeping, e.g., the result merging service and the hub search service unchanged. Future work in the Pepper project should aim at improving the single components. For resource selection, we have considered creating better hub descriptions. Hub descriptions should not only take direct DL neighbours into account, but should rather represent a network region. In addition, time costs (i.e., the number of hops) should be estimated in peer-to-peer networks. The browsing process will be further decentralised, and the application of other similarity measures and clustering algorithms will be investigated. In addition, the two access modes could be coupled more tightly. In an advanced scenario, dynamic cost-based selection could be applied not only to DLs during search, but also to the services themselves which are required for solving an arbitrary user task. A logic-based approach to this problem is proposed in Nottelmann and Fuhr (2004). Here, services are semantically described using OWL-S,7 mainly by defining input and output parameters. Similar to resource selection, services are annotated with their costs (time, money, quality). The service descriptions can be stored in the registration service as well. Then, the selection services (located in the hubs) apply recursive match-making rules for finding suitable services and their execution order, and use the cost estimations for computing an optimum selection (similar to resource selection).
9. Note Henrik Nottelmann passed away in April 2006, before we could submit a revised version of this paper. Although the Pepper project will be continued by others, it will take a considerable amount of time and effort for any successor to resume Henrik’s research. We, his colleagues in Duisburg, hope to continue his work after a necessary intermission, in his memory.
Acknowledgements Part of this work is supported in part by the DFG (project ‘‘Pepper’’, grant BIB47 DOuv 02-01). We wish to thank Alexej Titarenko and Andre´ Nurzenski for their help with the implementation and the concepts used in Pepper. 7
http://www.daml.org/services/owl-s/1.0/.
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642
641
References Aberer, K., Cudre-Mauroux, P., Hauswirth, M., & van Pelt, T. (2004). GridVine: building internet-scale semantic overlay networks. In International semantic web conference (ISWC), Lecture Notes on Computer Science (pp. 107–121), Vol. 3298. Bender, M., Michel, S., Triantafillou, P., Weikum, G., & Zimmer, C. (2005). MINERVA: collaborative P2P search. In VLDB ’05: proceedings of the 31st international conference on very large data bases (pp. 1263–1266). VLDB Endowment. Bender, M., Michel, S., Zimmer, C., & Weikum, G. Bookmark-driven query routing in peer-to-peer web search. In Callan, Fuhr, and Nejdl (2004). Callan, J., & Connell, M. (2001). Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2), 97–130. Callan, J., Fuhr, N., & Nejdl, W. (Eds.) (2004). SIGIR workshop on peer-to-peer information retrieval. Callan, J. P., Lu, Z., & Croft, W. B. (1995). Searching distributed collections with inference networks. In E. A. Fox, P. Ingwersen, & R. Fidel (Eds.), Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval (pp. 21–29). New York: ACM. Cuenca-Acuna, F. M., Peery, C., Martin, R. P., & Nguyen, T. D. (2002). PlanetP: a content addressable peer-to-peer data store. Technical Report DCS-TR-484, Department of Computer Science, Rutgers University, Piscataway. Cutting, D. R., Karger, D. R., & Pedersen, J. O. (1993). Constant interaction-time scatter/gather browsing of very large document collections. In Proceedings of the sixteenth annual international ACM SIGIR conference on research and development in information retrieval (pp. 126–134). New York: ACM. Cutting, D. R., Pedersen, J. O., Karger, D., & Tukey, J. W. (1992). Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings of the fifteenth annual international ACM SIGIR conference on research and development in information retrieval (pp. 318–329). New York: ACM. Ding, H., & Sølvberg, I. (2004). Exploiting extended service-oriented architecture for federated digital libraries. In The 7th international conference of asian digital libraries. Dublin Core Metadata Initiative. Dublin Core Metadata Element Set, Version 1.1. Technical Report 1.1, Dublin Core Metadata Initiative, July 1999.
. Eisenhardt, M., Mu¨ller, W., & Henrich, A. (2003). Classifying documents by distributed P2P clustering. In K. R. Dittrich, W. Ko¨nig, A. Oberweis, K. Rannenberg, & W. Wahlster, (Eds.), Proceedings Informatik 2003, GI (pp. 286–291). Fischer, G., & Nurzenski, A. (2005). Towards scatter/gather browsing in a hierarchical peer-to-peer network. In H. Nottelmann, K. Aberer, J. Callan, & W. Nejdl, (Eds.), Proceedings of the 2005 ACM workshop on information retrieval in peer-to-peer networks (P2PIR 2005), Bremen, Germany, November 4, 2005. Fuhr, N. (1999). A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems, 17(3), 229–249. Fuhr, N. (2000). Probabilistic datalog: implementing logical information retrieval for advanced applications. Journal of the American Society for Information Science, 51(2), 95–110. Furmento, N., Hau, J., Lee, W., Newhouse, S., & Darlington, J. (2003). Implementations of a service-oriented architecture on top of Jini, JXTA and OGSI. In Proceedings of UK e-science all hands meeting. Gioldasis, N., Pappas, N., Kazasis, F., Anestis, G., & Christodoulakis, S. (2004). A P2P and SOA infrastructure for distributed ontologybased knowledge management. In M. Agosti, H.-J. Schek, & C. Tnrker, (Eds.), Digital library architectures: peer-to-peer, grid, and service-orientation, sixth thematic workshop of the EU network of excellence DELOS on digital library architectures. Gong, L. (2001). JXTA: a network programming environment. IEEE Internet Computing, 5, 88–95. Gravano, L., Garcia-Molina, H., & Tomasic, A. (1994). The effectiveness of GlOSS for the text database discovery problem. In R. T. Snodgrass & M. Winslett (Eds.), Proceedings of the 1994 ACM SIGMOD. International conference on management of data (pp. 126–137). New York: ACM. Harren, M., Hellerstein, J. M., Huebsch, R., Loo, B. T. L., Shenker, S., & Stoica, I. (2002). Complex queries in DHT-based peer-to-peer networks. In Electronic proceedings for the 1st international workshop on peer-to-peer systems (IPTPS). Idreos, S., Koubarakis, M., & Tryfonopoulos, C. (2004). P2P-DIET: an extensible P2P service that unifies ad-hoc and continuous querying in super-peer networks. In Proceedings of the 2004 ACM SIGMOD international conference on management of data (pp. 933–934). New York, NY, USA: ACM Press. Klampanos, I. A., & Jose, J. M. (2004). An architecture for information retrieval over semi-collaborating peer-to-peer networks. In Proceedings of the 2004 ACM symposium on applied computing (pp. 1078–1083). New York, NY, USA: ACM Press. Lu, J., & Callan, J. Pruning long documents for distributed information retrieval. In Nicholas et al. (2002) (pp. 332–339). Lu, J., & Callan, J. (2003). Content-based retrieval in hybrid peer-to-peer networks. In D. Kraft, O. Frieder, J. Hammer, S. Qureshi, & L. Seligman (Eds.), Proceedings of the 12th international conference on information and knowledge management. New York: ACM. Lu, J., & Callan, J. Federated search of text-based digital libraries in hierarchical peer-to-peer networks. In Callan et al. (2004). Mu¨ller, W., Eisenhardt, M., & Henrich, A. (2005). Scalable summary based retrieval in p2p networks. In CIKM ’05: proceedings of the 14th ACM international conference on information and knowledge management (pp. 586–593). New York, NY, USA: ACM Press. Nejdl, W., Wolpers, M., Siberski, W., Schmitz, C., Schlosser, M., Brunkhorst, I., et al. (2003). Super-peer-based routing and clustering strategies for RDF-based peer-to-peer networks. In The twelfth international world wide web conference (WWW) (pp. 536–543). Nicholas, C., Grossman, D., Kalpakis, K., Qureshi, S., van Dissel, H., & Seligman, L. (2002). In Proceedings of the 11th international conference on information and knowledge management. New York: ACM.
642
H. Nottelmann, G. Fischer / Information Processing and Management 43 (2007) 624–642
Nottelmann, H., & Fuhr, N. (2003). Evaluating different methods of estimating retrieval quality for resource selection. In J. Callan, G. Cormack, C. Clarke, D. Hawking, & A. Smeaton (Eds.), Proceedings of the 26st annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM. Nottelmann, H., Fuhr, N. A logic-based approach for computing service executions plans in peer-to-peer networks. In Callan et al. (2004). Nottelmann, H., & Fuhr, N. (2006). Comparing different architectures for query routing in peer-to-peer networks. In 28th European conference on information retrieval research (ECIR 2006). Nottelmann, H., & Straccia, U. (2005) sPLMap: a probabilistic approach to schema matching. In D. E. Losada & J. M. F. Luna, (Eds.), 27th European Conference on Information Retrieval Research (ECIR 2005). Ritter, J. (2001). Why Gnutella can’t scale. No, really. Robertson, S. E. (1977). The probability ranking principle in IR. Journal of Documentation, 33, 294–304. Robertson, S. E., Walker, S., Hancock-Beaulieu, M., Gull, A., & Lau, M. (1992). Okapi at TREC. In Text REtrieval Conference (pp. 21– 30). Saroiu, S., Gummadi, P. K., & Gribble, S. (2002). A measurement study of peer-to-peer file sharing systems. In SPIE multimedia computing and networking (MMCN2002). Schlosser, M., Sintek, M., Decker, S., & Nejdl, W. (2005). Digital libraries. In 1st Workshop on agents and P2P computing. Si, L., Jin, R., Callan, J., & Ogilvie, P. Language modeling framework for resource selection and results merging. In Nicholas et al. (2002). Stoica, I., Morris, R., Karger, D., Kaashoek, F., & Balakrishnan, H. (2001). Chord: a scalable peer-to-peer lookup service for internet applications. In ACM SIGCOMM. Suel, T., Mathur, C., Wu, J., Zhang, J., Delis, A., Kharrazi, et al. (2003). Odissea: a peer-to-peer architecture for scalable web search and information retrieval. Technical Report TR-CIS-2003-01, Polytechnic Univ. Brooklyn, NY, USA. van Rijsbergen, C. J. (1986). A non-classical logic for information retrieval. The Computer Journal, 29(6), 481–485. Wang, Y., Galanis, L., & DeWitt, D. J. (2003). Galanx: an efficient peer-to-peer search engine system. Technical report, University of Wisconsin, Madison, USA, 2003. Waterhouse, S. (2001). JXTA search: distributed search for distributed networks. Technical report, Sun Microsystems, Inc. Wong, S. K. M., & Yao, Y. Y. (1995). On modeling information retrieval with probabilistic inference. ACM Transactions on Information Systems, 13(1), 38–68.