Information Processing and Management 39 (2003) 307–322 www.elsevier.com/locate/infoproman
A similarity based relational algebra for Web and multimedia data q Danilo Montesi a, Alberto Trombetta
b,* ,
Peter A. Dearnley
c
a
b
DCS, University of Bologna, Mura Anteo Zamboni, 7, Italy DSI, University of Milano, Dipartimento di Scienze dell’Informazione, Via Comelico 39, 20135 Milano, Italy c School of Information Systems, University of East Anglia, Norwich, NR4 7TJ, UK
Abstract Web and multimedia data are becoming very important. A fundamental characteristic of these data is imprecision. Query languages for Web and multimedia data must express imprecision in features matching, similarity queries and user preferences. In addition specific operators need to be introduced to organize the answers in a user friendly style. The aim of this work is to provide a formal framework in which to formulate very powerful queries and presentations of the answers. To this end, a fuzzy based algebra is introduced. The fuzzy algebra extends the classical relational algebra over fuzzy relations with new operators. Both algebras allow user preferences in the form of weights to be attached to predicates and operators. The effect of this weights is to alter the classic behavior of query expressions to better suite user requirements. In addition, optimization issues are presented in the form of algorithms for the efficient evaluation of similarity based queries, containing new algebraic operators. 2002 Elsevier Science Ltd. All rights reserved. Keywords: Web query language; Multimedia query language; Query evaluation; Fuzzy sets
1. Introduction The area of Web and multimedia queries have received considerable attention in the past (see for example Fagin, 1998; Florescu, Levy, & Mendelzon, 1998). Some studies dealing with of imprecise queries have focused on specific features, like, for example, fuzzy functional dependencies or are focused over some particular application domain (Zemankova-Leech & Kandel, 1984). Others, q
This work has been partially supported by D2I MURST Project. Corresponding author. E-mail addresses:
[email protected] (D. Montesi),
[email protected] (A. Trombetta),
[email protected] (P.A. Dearnley). *
0306-4573/03/$ - see front matter 2002 Elsevier Science Ltd. All rights reserved. PII: S 0 3 0 6 - 4 5 7 3 ( 0 2 ) 0 0 0 5 4 - 7
308
D. Montesi et al. / Information Processing and Management 39 (2003) 307–322
consider specific issues such as similarity queries (Agrawal, Faloutsos, & Swami, 1993) or efficient access methods (Ciaccia, Patella, & Zezula, 1997). What is lacking is a general approach to model data and queries dealing with imprecise information encountered when dealing with many real world objects like text, images and video, allowing also an efficient query evaluation. The aim of this paper is to extend the relational data model and algebra to captures in a faithful way the imprecise information encountered when dealing with many real world objects using fuzzy set theory. In addition, we present two algorithms for the efficient evaluation of complex queries occurring very often in our setting. Although we are aware of semistructured or objectoriented approaches to model Web and multimedia data (Buneman, 1997; Goebel et al., 1999) we focus here on the relational data model and algebra since we intend to show how imprecision can be defined in a well-known setting. We define a notion of similarity among the representations of such objects and we extend the algebra with new predicates and operators well suited for expressing powerful similarity-based queries and for useful presentations of the retrieved items. Similarity and imprecision are intrinsic features of the items represented in our data model. For such reason, we have based on simple fuzzy set theory concepts our similarity and imprecision representations, rather than on probability theory concepts, which is well suited for the representation of the uncertainty entailed by our limited knowledge about the real world when we model it by formal means. We refer to Crestani (1998) for a survey of probability-based methods for the representation of uncertainty in Information Retrieval. The ways in which we represent and properly use such imprecision are the concern of the extensions we provide to the classical relational model. We address the issues of imprecision in the relational model on three different levels. Imprecision occurs in our extended relational model when dealing with values of attributes. In this way, we mean that such values are elements of a domain with a fuzzy membership degree. These values are computed at data insertion time. We call this notion value imprecision. So, relations are sets of tuples formed by values and the corresponding fuzzy value memberships. Imprecision occurs using weighting values and similarity predicates provided at query time. We call this notion query imprecision to stress that this is under userÕs control. Finally, imprecision occurs at tuple level. We call this notion tuple imprecision. The tuple membership degree is computed with a scoring function that takes into account a weighted combination of value imprecisions. We use as scoring function the function fH introduced in Fagin and Wimmers (1997). Successively, we introduce a fuzzy relational algebra for adequately querying fuzzy relations. Such algebra is an extension of the classical relational algebra. We extend the usual operations of selection, projection, join, union and difference in order to deal with relations over fuzzy values. Clearly, such definition has to take into account the new features brought by having non-classical truth degrees paired with attributesÕ values. New algebraic operators are introduced in order to fully exploit the expressiveness of the fuzzy relational algebra. We employ the fuzzy relational algebra to faithfully represent the information stored in various search engines and to express powerful queries over them. Since search engines can be very expensive resources (both in terms of time and space necessary to obtain an answer), we deploy equivalence and containment rules holding among algebraic expressions to rewrite queries and obtaining less expensive ones yielding the same results. In particular, we show how to rewrite two very useful complex kind of queries expressible in our fuzzy relational algebra and we show two algorithms––based on the A0 algorithm presented by Fagin (1999) for their efficient evaluation that take into account the fact that search engines offer very limited access methods to the information stored in them.
D. Montesi et al. / Information Processing and Management 39 (2003) 307–322
309
The remainder of this work is structured as follows. In Section 2 we present some related work. In Section 3 we overview our approach through Web and multimedia examples. Section 4 introduces our fuzzy relational algebra incrementally and in Section 5.1 the query evaluation algoritms are presented. Finally, in Section 6 we draw some conclusions.
2. Related works In the literature, many fuzzy relational algebras have been proposed. Most of them are derived from the model proposed by Buckles and Petry (1982). In this model, the imprecision associated with fuzzy data is described using attributes taking values over linguistic terms (such as High, Low or Good, Moderate, Bad). A notion of similarity among these linguistic terms is developed in such a way that it is possible to express statements like ‘‘Moderate is 0.5-similar to Bad’’. No notion of tuple imprecision and query weight are introduced. Aside from usual queries expressible also in the classical relational algebra, it is possible to select tuples that have similar attributes over a given threshold. In this work no attention is paid to the study of equivalence properties among expressions of the algebra suitable for algebraic query processing. Rather, The work in the field of similarity based retrieval has focused on implementation issues such as the description of index structures for storing and retrieving multimedia objects (Faloutsos, 1996). Only recently there have been proposals developing theoretical frameworks in which define notions of similarity and the corresponding query languages. In Jagadish, Mendelzon, and Milo (1995), a very general framework independent from any particular definition of similarity is presented. In Adali, Bonatti, Sapino, and Subrahmanian (1998), the authors present a multi-similarity algebra in which is possible to express queries asking for the top objects similar to a given one and asking for the similarity degree of given objects. The measure of imprecision––over which the similarity among objects is computed––is associated with objectsÕ attributes. This is done independently from a fixed definition of similarity relation among objects. The relational algebra is extended with two operators performing the previously presented questions. Equivalence rules for the algebra, along with their cost models, are presented. In Ciaccia (1998), a fuzzy relational algebra for multimedia environments is presented. Though the aim of the work is similar to the previous one, the methods used are different. The integration of different information sources is done using the weighted scoring function fH of Fagin, defined in Fagin and Wimmers (1997). Such function allows the user to compute an overall tuple imprecision using a weighted combination of the corresponding value imprecisions. The focus of the work is centered around the development of indexing methods for the algebra previously presented. The standard fuzzy interpretation of the connectives is used to deal with the evaluation of boolean combinations of atomic queries involving fuzzy sets. The approach is independent from the data model and is focused on evaluation issues such as the complexity of merging ordered answers coming from different sources in a consistent manner.
3. Motivating examples Our examples describe settings in which there are different information sources, such as search engines, providing (classically speaking) inconsistent information about Web sites or in which the
310
D. Montesi et al. / Information Processing and Management 39 (2003) 307–322
stored data have an intrinsic degree of vagueness, such as a database containing informations about hand-written characters. Our approach allows the representation of such inconsistencies and vagueness and, further, exploit them to query the data at a finer level than the classical one. Namely, it is very easy to tell apart tuples having the same attributes values but different value imprecisions. We assume that three fuzzy relations AltaVista, Google and HotBot are given. Each of these relations contains data concerning the Web pages indexed by the respective search engine. We assume also that data about imprecision are normalized, in the sense that all the membership degrees of the attributesÕ values are expressed as real numbers in the interval ð0; 1. The data are relative to imprecise notions such as the topics the Web pages are about, how much they are referenced in other Web pages, how much they are connected to other Web pages, the amount of links contained in the pages and the amount of links pointing to them. The schemas of the fuzzy relations involved are: AltaVistaImages(Url,Topic:l,Update,Referenced), HotBot(Url,Topic:l, InDegree,OutDegree), Google(Url,Topic:l,ConnectionDegree). We write Topic:l meaning that the values of the attribute Topic can belong to their domain with a membership degree l between 0 and 1. In this way, every tuple of such fuzzy relations represent the information that, with respect to a given search engine, a Web page is about a specified topic with score given by l. In the following, we show detailed examples of the queries over the fuzzy relations presented in the previous subsections concerning non-classical features such as weighted queries and similarity based queries expressible within our framework, along with the corresponding algebra expressions. Similarity selection query: A relevant extension deals with the introduction in the selection operator that chooses tuples according to a specified similarity predicate, denoted with . Consider the following example, selecting only the tuples having membership degree on a fixed attribute greater or equal than a fixed value: ‘‘Find all the pages indexed by HotBot having topic similar to Ôpaintings’’Õ. This operation cannot be considered as a classical selection, since it involves a non-classical feature like a similarity predicate, which can be satisfied with any truth degree between 0 and 1: rtopic\paintings’’ ðHotBotÞ: Weighted join query: An important feature of our extended relational algebra is the possibility to express how much the arguments of an operator concur in the determination of the tuple membership degree of the tuples contained in the answer. As an example of weighted query, we consider the following one, involving a join: ‘‘Find all the pages in AltaVista and HotBot having the same topic. The pages from Google are twice as much as relevant with respect to the pages coming from HotBot’’: AltaVista2=3 fflGoogle:topic¼HotBot:topic HotBot1=3 : Top query: Our extended relational algebra provides new operators exploiting features of the our extended relational model, like the tuple membership degree. In particular we introduce operators for limiting the answerÕs cardinality by retrieving only the tuples having high tuple membership degree. Here we show a query involving the Top operator, whose effect is to retrieve only the first k tuples satisfying the selection condition (k is a specified natural number and the tuple are retrieved in descending order according to their tuple membership degrees): ‘‘Retrieve the first seven pages from Google having topic similar to Ôrenaissance’’Õ: s7topic\renaissance’’ ðGoogleÞ:
D. Montesi et al. / Information Processing and Management 39 (2003) 307–322
311
4. Fuzzy relational algebra We present the extended data model allowing the representation of imprecision at the attribute as well at the tuple level. We redefine the operators of the classical relational algebra in order to take into account weighted and similarity based queries. Then, we introduce the new operators Top and Cut which, respectively, limit the answerÕs cardinality and discard those tuples with tuple membership degree lower than a specified threshold. We define the data model and the algebra operators to represent and perform searches over imprecise data. Recall that a fuzzy domain D is a set with its characteristic function vD taking values over the unit real interval ð0; 1 (Zadeh, 1965). Given a fuzzy domain D and a constant c, we write c 2 D : l to state that c belongs to D with membership degree (or, equivalently, score) l, with 0 < l 6 1. A fuzzy tuple hc1 : l1 ; . . . ; cn : ln i is an extension of the classical definition, in which attributes can take values over fuzzy domains. The definition of fuzzy relational database is analogous to that of a classical relational database. More precisely, a fuzzy database schema r is a set of fuzzy relation schemas, r ¼ fr1 ; . . . ; rn g. A fuzzy database R is a set of fuzzy relations R ¼ fR1 ; . . . ; Rn g, where each Ri is a set of fuzzy tuples t1 ; . . . ; tu , instance of the corresponding schema ri . The values of the attributes A1 ; . . . ; An range over fuzzy domains D1 ; . . . ; Dn . We recall the notation to indicate the attribute membership degrees of the tuples in a fuzzy relation hc1 : l1 ; . . . ; cn : ln i 2 R is equivalent to Rðc1 : l1 ; . . . ; cn : ln Þ. Operators of the fuzzy relational algebra allow us to build arbitrarily complex expressions that represent complex queries. Aside from usual boolean predicates––like ¼ and 6 , for example–– complex selection conditions are built also from similarity predicates. A similarity predicate has the form A v, where A is an attribute, is a similarity predicate and v a value of a proper domain or A B, where A and B are attributes. The evaluation of a similarity predicate returns a real number between 0 and 1, denoting how much the value of attribute A is similar to the value v (or to the value of B). With a non-Boolean semantics, it is quite natural and useful to give the user the possibility to assign a different relevance to the conditions he states to retrieve tuples. Such ‘‘user preferences’’ can be expressed by means of weights, thus saying, for instance, that the score of a predicate on Color is twice as important as the score of a predicate on the Texture of an image. The seminal work by Fagin and Wimmers (1997) shows how any function sf (scoting function) evaluating the selection condition f ðp1 ; . . . ; pn Þ, satisfying some generic properties, can be properly extended into a weighted version, sfH , where H ¼ ½h1 ; . . . ; hn is a vector of weights (a ‘‘weighting’’), in such a way that: 1. sfH reduces to sf when all the weights are equal; 2. sfH does not depend on sðpi ; tÞ when hi ¼ 0; 3. sfH is a continuous function of the weights. Let si ¼ sðpi ; tÞ denote the score of t with respect P to the predicate pi , and assume without loss of generality h1 P h2 P P hn , with hi 2 x and i hi ¼ 1. Then, Fagin and WimmersÕ formula is: sfH ðs1 ; . . . ; sn Þ ¼ ðh1 h2 Þs1 þ 2ðh2 h3 Þsf ðs1 ; s2 Þ þ 3ðh2 h3 Þsf ðs1 ; s2 ; s3 Þ þ þ nhn sf ðs1 ; s2 ; . . . ; sn Þ:
ð1Þ
312
D. Montesi et al. / Information Processing and Management 39 (2003) 307–322
Although above formula is usually used to weigh the predicates appearing in a (selection) formula, we extend most of the fuzzy algebra operators allowing them to incorporate weights. In the following we present the basic algebraic operators selection, projection, join, union and difference. Selection: The Selection operator applies a formula f to the tuples satisfying the expression e and filters out those which do not satisfy f. The novel point here is that, as an effect of f and of weights, the grade of a tuple t can change. Weights can be used for two complementary needs: In the first case, they weigh the importance of predicates in f, as in Fagin and Wimmers (1997), thus leading to use the scoring function sfHf in place of sf . In the second case they are used to perform a weighted conjunction, ^H , between the score computed by f and the ‘‘input’’ tuple score, tle . This determines the new tuple score, tl: H rH f ðeÞ ¼ ftjt 2 e ^ tl ¼ s^ ðsðf ; tÞ; tle Þ > 0g:
ð2Þ
Projection: As in the classic relational algebra, the Projection operator removes a set of attributes and then eliminates duplicate tuples. Projection can also be used to discard scores, both of fuzzy attributes and of the whole tuple. In this case, however, in order to guarantee consistency of subsequent operations, such scores are simply set to 1, so that they can still be referenced in the resulting schema. This captures the intuition that if we discard, say, the tuplesÕ scores, then the result is a crisp relation, that is, a fuzzy relation whose tuples all have score 1. Formally, let e be a relation with schema EðX Þ, Y X , and V a set of v-annotated fuzzy attributes, V ¼ fAvi g, where V contains exactly those fuzzy attributes for which scores are to be discarded. Note that V can include Avi only if Ai 2 X Y . Finally, let F stand for either l or the empty set. Then, the projection of e over YVF is a relation with schema YW, where if Avi 2 V then Ai 2 W , defined as follows: pYVF ðeÞ ¼ ft½YW j9t0 2 e : t½YV ¼ t0 ½YV ^ 8Avi 2 V : tAli ¼ 1 ^ tl ¼ s_ ft00 lE jt00 ½YV ¼ t½YV g if F ¼ l; otherwise tl ¼ 1g:
ð3Þ
Thus, tuplesÕ scores are discarded (i.e. set to 1) when F ¼ ;, whereas they are preserved when F ¼ l. In the latter case, new scores are computed by considering the ‘‘parametric disjunction’’, s_ , of the scores of all duplicate tuples with the same values for YV. Join: In the fuzzy relational algebra the weighted (natural) Join is an n-ary operator, 1 which, given n relations ei with schemas Ei ðX Þ, computes the score of a tuple t as a weighted conjunction, ^H ðt1 lE1 ; . . . ; tn lEn Þ, with H ¼ ½h1 ; . . . ; hn , of the scores of matching tuples. The introduction of weights in the definition makes the join operator non-associative, thus making necessary the introduction of an operator with n arguments. The definition of the n-ary weighted Join operator is: fflH ðe1 ; . . . ; en Þ ¼ ft½X1 . . . Xn j9t1 2 e1 ; . . . ; 9tn 2 en : t½X1 ¼ t1 ½X1 ^ ^ t½Xn ¼ tn ½Xn ^ tl ¼ sH ^ ðt1 lE1 ; . . . ; tn lEn Þ > 0g:
ð4Þ
Union: Also the Union is an n-ary operator, where the score of a result tuple t is a weighted disjunction, sH _ ðtlE1 ; . . . ; tlEn Þ, of the input tuplesÕ scores: [H ðe1 ; . . . ; en Þ ¼ ftjðt 2 e1 _ _ t 2 en Þ ^ tl ¼ sH _ ðtlE1 ; . . . ; tlEn Þ > 0g:
1
We also use the infix notation, E1 ffl½h1 ;h2 E2 , when only two operands are present.
ð5Þ
D. Montesi et al. / Information Processing and Management 39 (2003) 307–322
313
Note that, because of the presence of weights, Union is not associative anymore, as already noted for the weighted Join operator. This implies that the n-ary Union cannot be defined in terms of n 1 binary unions, as it happens in RA. Difference: Given relations e1 and e2 with schemas E1 ðX Þ and E2 ðX Þ, respectively, their Difference is defined as: e1 e2 ¼ ftjt 2 e1 ^ tl ¼ s^ ðtlE1 ; s: ðtlE2 ÞÞ > 0g:
ð6Þ
Top: The Top operator retrieves the first k (k is an input parameter) tuples of a relation e, according to a ranking criterion, as expressed by a ranking function g. If weights are used to rank tuples according to gHg , then g has to be a formula of predicates over the schema of e. 2 If e has no more than k tuples, then skgHg ðeÞ ¼ e, otherwise: = t0 : t0 skgHg ðeÞ ¼ ftjt 2 e ^ sðgHg ; tÞ > 0 ^ jskgHg ðeÞj ¼ k ^ 8t 2 skgHg ðeÞ : 9 2 e ^ t0 62 skgHg ðeÞ ^ gHg ðt0 Þ > gHg ðtÞg
ð7Þ
with ties arbitrarily broken. When g is omitted, the default ranking criterion, based on the score of tuples, applies, thus the k tuples with the highest scores are returned. Cut: The Cut operator ‘‘cuts off’’ those tuples which do not satisfy a formula g, that is: cg ðeÞ ¼ ftjt 2 e ^ sðg; tÞ > 0 ^ tl ¼ tlE > 0g:
ð8Þ
Unlike Selection, Cut does not change tuples’ scores. Thus, if g includes non-Boolean predicates, the two operators would behave differently. However, the major reason to introduce Cut is the need of expressing (threshold) conditions on tuples’ scores, e.g. l > 0:6. Such a predicate cannot be part of a Selection, since it does not commute with others. This is also to say that the expressions cl>0:6 ðrf ðEÞÞ and rf ðcl>0:6 ðEÞÞ are not equivalent. Indeed, the first expression is contained in the second one, that is Ciaccia, Montesi, Penzo, and Trombetta, 2000: cl>a ðrf ðEÞÞ rf ðcl>a ðEÞÞ: In the following, we write
caf ðEÞ
ð9Þ lPa
in place of c
ðrf ðEÞÞ.
5. Query processing The fuzzy relational algebra provides a formal framework for the definition of a similarity based, SQL-like Web query language, called eWebSQL (Trombetta, 2001), whose implementation is under way. Such Web query language allows the specification of highly complex similarity based queries accessing, in the query evaluation phase, several search engines from a fuzzy relational algebra perspective, a search engine can be viewed as a fuzzy relation. More precisely, given a search engine SE, the fuzzy relation SE having schema SEðurl; keyword : lÞ, denotes the search engine SE. The Web page urls indexed by SE are stored in the attribute url, while keywords having non-null score with respect to Web pages are stored in the fuzzy attribute keyword, along with the corresponding score l.
2
If ‘‘bottom’’ tuples are needed, the ranking directive < can be used, written gHg ;< .
314
D. Montesi et al. / Information Processing and Management 39 (2003) 307–322
In this way, eWebSQL queries are expressible as fuzzy relational algebra expressions. The query rewriting process is driven by several equivalence and containment rules holding for the fuzzy relational algebra, and by an appropriate cost model of the fuzzy relational algebra operator implementations. The equivalence and containment rules are discussed in greater detail in Trombetta (2001) and Ciaccia et al. (2000) and have led to interesting results. Even if the definition of a detailed cost model for the fuzzy relational algebra is outside the scope of this work, we present two algorithms for efficiently compute the answer of top-based and cut-based fuzzy relational algebra expressions. The algorithms are described in Section 5.2 and take into account the fact that the relations SE1 ; . . . ; SEn denoting n different search engines can be accessed only through the corresponding interfaces. Such kind of access will be formalized using the notions of sorted and random access, introduced in Fagin (1999). 5.1. Equivalence and containment rules The query optimizer is a very important component of a Database Management System (Ramakrishnan & Gehrke, 2000). Since query evaluation is usually a very expensive task––in terms of space and time complexity––aim of the query optimizer is to find, given a query Q, a query Q0 having lower complexity of Q and yielding the same result. A common heuristic, followed by all the commercial DBMSs, is to find an equivalent query Q0 that minimizes the size of the intermediate results, thus minimizing the number of I/Os. This task is even more difficult in a because of the evaluation of a fuzzy relational algebra expression possibly involves search engine access and involves complex similarity predicates whose evaluation time can be comparable to that required by an I/O operation. Since the query optimizer tries to minimize intermediate query result sizes, the expression E1 ¼ skf ðR1 ffl R2 Þ is more expensive than the expression E2 ¼ skf ðskf ðR1 Þ ffl skf ðR2 ÞÞ. This is true because the number of tuples to be joined in the second expression is reduced by the application of Top operator to both join operands. Thanks to the equivalence and containment rules for fuzzy relational algebra expressions, the query optimizer can safely rewrite E1 into E2 , being the result of the former expression being contained in the result of the latter. We present here some of fuzzy relational algebra equivalence rules useful for eWebSQL query optimization. For each rule,we give both the unweighted and weighted version. We write E1 ¼ E2 (E1 E2 ) meaning that expression E1 yields the same result (rep. is contained in the result) of expression E2 . The rule––in its unweighted version––is: skf ðR1 ffl ffl Rn Þ skf ðskf ðR1 Þ ffl ffl skf ðRn ÞÞ:
ð10Þ
It states that the computation of the first k tuples––according to tuples scores––satisfying the join R1 ffl ffl Rn can be accelerated by applying the Top operator to every subexpression R1 ; . . . ; Rn , thus reducing the size of join result. The previous rule holds when all the join wieghts are equal. It is quite remarkable that a similar rule holds also for unequal weights, as we will see also for other equivalence rules. The weighted version of Rule 10 is: h
h
skf ðRh11 ffl ffl Rhnn Þ skf ðskf ðR1 Þ 1 ffl ffl skf ðRn Þ n Þ:
ð11Þ
D. Montesi et al. / Information Processing and Management 39 (2003) 307–322
315
Unlike the case of the Top operator––where the application of the Top operator before the join (possibly) discards tuples from the result of application of the Top operator after the join–– applying the Cut operator before the join safely reduces the number of tuples to be processed in order to obtain the final result: caf ðR1 ffl ffl Rn Þ ¼ caf ðcaf ðR1 Þ ffl ffl caf ðRn ÞÞ:
ð12Þ
The weighted version of the same rule holds caf ðR1 fflh1 fflhn Rn Þ ¼ caf ðcaf 1 ðR1 Þh1 ffl ffl caf n ðRn Þhn Þ provided that the thresholds ai are equal to Pi1 a j¼1 jðhj hjþ1 Þ ai ¼ ; i 2 ½1; . . . ; n: P 1 i1 j¼1 jðhj hjþ1 Þ
ð13Þ
ð14Þ
For example, when n ¼ 3, ½h1 ; h2 ; h3 ¼ ½0:5; 0:3; 0:2 and a ¼ 0:6, it is a1 ¼ 0:6, a2 ¼ 0:5 and a3 ¼ 0:3. Another useful equivalent rule is the following one, showing the interplay between Cut and Top operators: ca ðskg ðEÞÞ ¼ skg ðca ðEÞÞ:
ð15Þ
Thanks to this rule, the order of application of Top and Cut operators is irrelevant. Finally, we present equivalence rules showing the interactions between Cut (and Top) and Union operators: skf ðE1 [ [ En Þ ¼ skf ðskf ðE1 Þ [ [ skf ðEn ÞÞ;
ð16Þ
skf ðE1h1 [ [ Enhn Þ ¼ skf ðskf ðE1 Þh1 [ [ skf ðEn Þhn Þ;
ð17Þ
caf ðE1 [ [ En Þ ¼ caf ðskf ðE1 Þ [ [ skf ðEn ÞÞ;
ð18Þ
caf ðE1h1 [ [ Enhn Þ ¼ caf ðcaf ðE1 Þh1 [ [ caf ðEn Þhn Þ:
ð19Þ
Having at disposition such equivalence and containment rules, we show how the query optimizer uses them in order to choose among the equivalent query plans of a fuzzy relational algebra query the one which minimizes the size of intermediate query results. Consider the expression: s5ðcolorAltaVistaImages \white"Þ^ðcontentGoogle \Raffaello"Þ^ðstart
url¼\louvre"^end url¼Document:urlÞ
ðGoogle ffl AltaVista ffl Document ffl LinkÞ:
ð20Þ
The selection conjunctive condition has to be applied to the entire result of the join Google ffl AltaVistaImages ffl Document ffl Link. Only at the end the top operator is applied, as shown by the corresponding query plan in Fig. 1. rGoogle^AltaVistaImages^url stands for: rðcontentGoogle \Raffaello"^start url¼\louvre"^end url¼Document:urlÞ ^ rðcolorAltaVistaImages \white"^start url¼\louvre"^end url¼Document:urlÞ :
316
D. Montesi et al. / Information Processing and Management 39 (2003) 307–322
Fig. 1. First query plan.
On the other side, Expression (20) contains the expression s5ðcontentGoogle \Raffaello"Þ^ðcolorAltaVistaImages \white"Þ^ðstart
url¼\louvre"^end url¼Document:urlÞ
ðs5ðcontentGoogle \Raffaello"Þ^ðstart url¼\louvre"^end url¼Document:urlÞ ð21Þ
ðGoogle ffl Document ffl LinkÞ ffl s5ðcolorAltaVistaImages \white"Þ^ðstart
url¼\louvre"^end url¼Document:urlÞ
ðAltaVistaImages ffl Document ffl LinkÞÞ thanks to Rule 10. Although Expression (22) looks complex, it reduces the size of intermediate relation by careful applications of the query operators: first, relations Google and AltaVistaImages are separately joined with relation Document ffl Link, the selection conjunctive condition is split and evaluated on the corresponding intermediate relation. Then, the Top operator is separately applied over the resulting set of tuples, yielding the top five tuples satisfying the selection condition according to the Google search engine and the top five tuples satisfying the selection condition according to the AltaVistaImages image search engine. These two sets are therefore joined and the top operator is again applied. The query plan corresponding to Expression (22) is depicted in Fig. 1. rGoogle^url and rAltaVistaImages^url stand respectively for: rðcontentGoogle \Raffaello"Þ^ðstart
url¼\louvre"^end url¼Document:urlÞ
and for rðcolorAltaVistaImages \white"Þ^ðstart
url¼\louvre"^end url¼Document:urlÞ :
Let us consider another example concerning an equivalence rule involving the Cut operator and weighted join: c0:6 ðcolorAltaVistaImages \white"Þ^ðcontentGoogle \Raffaello"Þ^ðstart ððGoogle
0:6
0:4
url¼\louvre"^end url¼Document:urlÞ
ffl AltaVista Þ ffl Document ffl LinkÞ:
ð22Þ
Also in this case, in order to compute the corresponding answer, the query processor has to compute the (large) result of the join ðGoogle0:6 ffl AltaVista0:4 Þ ffl Document ffl Link before the
D. Montesi et al. / Information Processing and Management 39 (2003) 307–322
317
selection condition and the threshold on Web pagesÕ score can be applied. Thanks to Rule 13, the Expression (22) can be safely rewritten as: c0:7 ðcolorAltaVistaImages \white"Þ^ðcontentGoogle \Raffaello"Þ^ðstart
url¼\louvre"^end url¼Document:urlÞ
ðc0:7 ðcontentGoogle \Raffaello"Þ^ðstart url¼\louvre"^end url¼Document:urlÞ
ð23Þ
ðGoogle ffl Document ffl LinkÞ0:6 ffl c0:625 ðcontentAltaVistaImages \white"Þ^ðstart
url¼\louvre"^end url¼Document:urlÞ
ðAltaVista ffl Document ffl LinkÞ0:4 Þ:
ð24Þ
5.2. Query evaluation In the previous sections we have shown the first steps of the fuzzy relational algebra query processing, consisting in query rewriting driven by some of the specified equivalence and containment rules. Now we focus our attention on algorithms for efficient evaluation of fuzzy relational algebra queries. In particular, we will focus on queries involving Cut and Top operators. We start considering expressions of the form skA1 v1 ^
An vn ðR1 ffl ffl Ru Þ. Fagin (1999) has proposed an algorithm––called A0 ––for the computation of the top k objects satisfying a conjunction of atomic queries where each of them is evaluated over a different subsystem. It is assumed that there exist some relation such that the query processor can access its tuples and attributesÕ scores in two different ways: under sorted access, the query processor retrieves the tuples ordered according to the score and stops after a fixed number of tuples are retrieved. Under random access, the query processor is given a specified tuple and a query and outputs the corresponding tuple score. Why use algorithm A0 ? If s is the database size, n is the number of different search engines and the queries submitted to them are statistically independent, then it can be proven that the cost (in terms of number of sorted and random accesses to the database) of the A0 algorithm is on average Oðsðn1Þ=n k 1=n Þ with arbitrarily high probability. Note that the cost is sublinear. For a detailed study of the A0 complexity, refer to Fagin (1999). In particular, if n ¼ 2 (that is, only two search engines are specified), then the cost of algorithm A0 is of the order of the square root of the database size. The worst case analysis of the A0 algorithm shows that its cost is linear in the database size. Also, note that the naive algorithm that retrieves all the ordered tuples from the search engines, computes the minimum score for every tuple and then outputs the first k tuples is linear in the database size. Thus, algorithm A0 efficiently computes the answer of the fuzzy relational algebra query: skSE1 SE
v ^
^SEn SEn vn 1 1
ðRÞ:
ð25Þ
Of course, this is just one of the many queries expressible using fuzzy relational algebra. Apart from Query 25, we consider the following one, containing a Cut operator: caSE1 SE
v ^
^SEn SEn vn 1 1
ðRÞ:
ð26Þ
In the following, we present two algorithms, called respectively Top and Cut for the efficient evaluation of Queries 25 and 26. The Top algorithm is an implementation of FaginÕs A0 having in
318
D. Montesi et al. / Information Processing and Management 39 (2003) 307–322
Fig. 2. Second query plan.
mind that sorted access is the only way to retrieve data from search engines. Random access is simulated using sorted access in this simple way: we start retrieving pages in sorted access fashion, until the desired page pops out, along with the corresponding attribute degree. The Cut algorithm evaluates efficiently Query 26 taking into account the same search enginesÕ access limitations described for Query 25. We briefly review the Top algorithm in a informal way, giving the pseudocode in Fig. 2 and commenting it afterwards. We assume having at disposition a procedure AskSEðSEi ; Q; jÞ submitting to the search engine SEi the (keyword based) query Q and retrieving the first j Web pages, according to their relevances to Q. The Top algorithm consists of three phases: • Sorted access phase: For each search engine SE1 ; . . . ; SEn , use the procedure AskSE to retrieve the urls of the Web pages having greatest scores for query Q. Store the retrieved urls in commonset. The sorted access phase continues until commonset contains k elements. • Random access phase: Every search engine provides the attribute score of every Web page retrieved in the previous phase. Thus, for every Web page whose url is contained in commonset, use AskSE to find its corresponding score for every search engine SE1 ; . . . ; SEn . • Computation phase: For every url retrieved, compute the minimum of the different scores retrieved from the search engines SE1 ; . . . ; SEn . The output is the set of k urls having the highest scores. A similar algorithm is employed in the evaluation of the query ca ðrA1 v1 ^
^An vn ðRÞÞ. The pseudocode is shown in Fig. 2. In this case, the Cut algorithm interleaves sorted accesses with random accesses in order to control that both the retrieved tuple scores satisfy the threshold specified by the Cut operator. Note that AskSEði; Q; j1 ; j2 Þ stands for AskSEði; Q; j2 Þ AskSEði; Q; j1 Þ, where j1 6 j2 and denotes set difference. More precisely, the phases of the algorithm are:
D. Montesi et al. / Information Processing and Management 39 (2003) 307–322
319
Fig. 3. The algorithm for the evaluation of ca ðskc ðRÞÞ.
Fig. 4. Top algorithm.
• Sorted and random access steps: For each search engine relations SE1 ; . . . ; SEn , retrieve tuples having highest scores not yet retrieved. For every tuple retrieved in this way, ask every search engine for the corresponding tuple score.
320
D. Montesi et al. / Information Processing and Management 39 (2003) 307–322
• Threshold control step: For every 1 6 i 6 n, check whether tuple scores satisfy Return to the sorted and random access step and retrieve only from those relations satisfying the threshold condition tl P a. • Computation phase: For each tuple retrieved, compute the score minðtj lR1 ; . . . ; tj lRn Þ. The output is the set of tuples having the highest scores greatest or equal to a. Next, we turn our attention to the problem of efficient evaluation of expressions having the form ca ðskc ðEÞÞ. A straightforward execution strategy would require that the Cut operator would be executed only after the completion of the Top operator execution. The drawback of this strategy lies in the fact that the standalone execution of the Top operator may yield a large intermediate result that would be (possibly) discarded by the cut operator execution. Here we propose a simple yet effective algorithm that does not require the computation of the entire intermediate result of the Top operator execution. The algorithm is based on the simple idea that, whenever a Web page is included in the result of the Top operator, its score is checked against the threshold specified by the Cut operator: if the Web page score is greater or equal than
Fig. 5. The Cut algorithm.
D. Montesi et al. / Information Processing and Management 39 (2003) 307–322
321
the threshold, then the next Web page in the result of the Top operator is checked. Otherwise, the entire execution is stopped (Figs. 3–5).
6. Conclusions In this paper we have presented a fuzzy logic-based extension to classical relational model suitable to represent the impreciseness related to various kinds of data, such as those present on the Web. A fuzzy relational algebra that permits to formulate queries asking about the similarity of objects represented in the fuzzy relational model is defined. A presentation algebra suitable for ordering and discarding the less relevant items in the answers is developed. Equivalence and containment rules suitable for query optimizations are studied. Finally, new kind of equivalence notions among fuzzy relational expressions are introduced. Future work will focus on the development of more complex data models based on, for example, nested relational models or semistructured data models, in order to adequately represent Web documents and sites. Further, the study of more appropriate notions of similarity deserves much attention, as well as the development of optimization techniques for the corresponding algebras.
References Adali, S., Bonatti, P., Sapino, M. L., & Subrahmanian, V. S. (1998). A multi-similarity algebra. In Proceedings of the 1998 ACM-SIGMOD international conference on management of data. Agrawal, R., Faloutsos, C., & Swami, A. (1993). Efficient similarity search in sequence databases. In Proceedings of the 4th international conference on foundations of data organizations and algorithms (FODO’93) (pp. 69–84). Buckles, B. P., & Petry, F. E. (1982). A fuzzy representation of data for relational databases. Fuzzy Sets and Systems, 7, 213–226. Buneman, P. (1997). Semistructured data. In ACM, editor, PODS Õ97. Proceedings of the sixteenth ACM SIGSIGMOD-SIGART symposium on principles of database systems (pp. 117–121). May 12–14, 1997, Tucson, Arizona. ACM Press. Ciaccia, P. (1998). An algebra for similarity queries and its index-based evaluation. ESPRIT project 9141 HERMES Technical Report. Ciaccia, P., Montesi, D., Penzo, W., & Trombetta, A. (2000). Imprecision and user preferences in multimedia queries: a generic algebraic approach. In Proceedings of conference on foundations on information and knowledge systems. Ciaccia, P., Patella, M., & Zezula, P. (1997). M-tree: an efficient access method for similarity search in metric spaces. In Proceedings of the 23rd VLDB international conference (pp. 426–435). Crestani, P. (1998). Is this document relevant? . . . probably. ACM Computing Surveys, 30(4), 528–552. Fagin, R. (1998). Fuzzy queries in multimedia database systems. In Proceedings of the seventeenth ACM SIGACTSIGMOD-SIGART symposium on principles of database systems, June 1–3, 1998, Seattle, Washington. Fagin, R. (1999). Combining fuzzy information from multiple systems. Journal of Computer and System Sciences, 58, 83–99. Fagin, R., & Wimmers, E. (1997). Incorporating user preferences in multimedia queries. In Proceedings of the seventh international conference on database theory. Faloutsos, C. (1996). Searching Multimedia Databases by Content. Dordrecht: Kluwer Academic Press. Florescu, D., Levy, A., & Mendelzon, A. (1998). Database techniques for the World-Wide Web: A survey. SIGMOD Record (ACM Special Interest Group on Management of Data), 27(3).
322
D. Montesi et al. / Information Processing and Management 39 (2003) 307–322
Goebel, V., et al. (1999). Design, implementation, and evaluation of toomm: A temporal object-oriented multimedia data model. In WG 2.6 working conference on database semantics––semantics issues in multimedia (DS-8), Rotorua, New Zealand. Jagadish, H. V., Mendelzon, A., & Milo, T. (1995). Similarity-based queries. In Proceedings of the fourteenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems. Ramakrishnan, R., & Gehrke, J. (2000). Database management systems. New Jersey: Prentice Hall. Trombetta, A. (2001). Representing and Querying Imprecise Data. Ph.D. Thesis, Computer Science Dept., University of Turin. Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338–353. Zemankova-Leech, M., & Kandel, A. (1984). Fuzzy relational databases––A key to expert systems. Verlag TUV.