Data & Knowledge Engineering 35 (2000) 259±298
www.elsevier.com/locate/datak
Similarity-based ranking and query processing in multimedia databases K. Selcßuk Candan a, Wen-Syan Li b,*, M. Lakshmi Priya a a
Computer Science and Engineering Department, Arizona State University, Box 875406, Tempe, AZ 85287-5406, USA b C&C Research Laboratories, NEC USA, Inc., MS/SJ10, San Jose, CA 95134, USA Received 20 April 1999; received in revised form 21 December 1999; accepted 22 May 2000
Abstract Since media-based evaluation yields similarity values, results to a multimedia database query, Q
Y1 ; . . . ; Yn , is de®ned as an ordered list SQ of n-tuples of the form hX1 ; . . . ; Xn i. The query Q itself is composed of a set of fuzzy and crisp predicates, constants, variables, and conjunction, disjunction, and negation operators. Since many multimedia applications require partial matches, SQ includes results which do not satisfy all predicates. Due to the ranking and partial match requirements, traditional query processing techniques do not apply to multimedia databases. In this paper, we ®rst focus on the problem of ``given a multimedia query which consists of multiple fuzzy and crisp predicates, providing the user with a meaningful ®nal ranking''. More speci®cally, we study the problem of merging similarity values in queries with multiple fuzzy predicates. We describe the essential multimedia retrieval semantics, compare these with the known approaches, and propose a semantics which captures the requirements of multimedia retrieval problem. We then build on these results in answering the related problem of ``given a multimedia query which consists of multiple fuzzy and crisp predicates, ®nding an ecient way to process the query.'' We develop an algorithm to eciently process queries with unordered fuzzy predicates (sub-queries). Although this algorithm can work with dierent fuzzy semantics, it bene®ts from the statistical properties of the semantics proposed in this paper. We also present experimental results for evaluating the proposed algorithm in terms of quality of results and search space reduction. Ó 2000 Elsevier Science B.V. All rights reserved. Keywords: Approximate search; Fuzzy query processing; Fagin's algorithm; Probe function
1. Introduction Multimedia data includes image and video data which are very complex in terms of their visual and semantic contents depending on the application, multimedia objects are modeled and indexed using their (1) visual properties (or a set of relevant visual features), (2) semantic properties, and/ or (3) the spatial/temporal relationships of subobjects.
*
Corresponding author. Tel.: +1-408-943-3008; fax: +1-408-943-3099. E-mail addresses:
[email protected] (K. Selcuk Candan),
[email protected] (W.-S. Li),
[email protected] (M. Lakshmi Priya).
0169-023X/00/$ - see front matter Ó 2000 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 9 - 0 2 3 X ( 0 0 ) 0 0 0 2 5 - 2
260
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
Fig. 1. Fuzzy media modeling example.
Example 1.1. For instance, Fig. 1 gives an example where multimedia data is modeled using both visual and semantic features. Fig. 1(a) shows an image, Boy_bike.gif, whose structure is viewed as a hierarchy with two image components (i.e., boy and bicycle). These components are identi®ed based on color/shape region division (visual interpretation) and their real world meanings (semantic interpretation). The relationship between an image and its components is contains. The spatial relationships can be described in 2D string representation [1]. Objects, Obj1 and Obj2, have both visual and semantic properties that can be used in retrieval. Fig. 1 shows a video frame whose structure can be viewed as a hierarchy with two image components identi®ed based on color/shape region division (visual interpretation) and their real world meanings (semantic interpretation). The spatial relationships of these components can be described using various techniques [1,2]. These components, or objects, have both visual and semantic properties that can be used in retrieval. In this example, objects have multiple candidate semantics, each with an associated con®dence value (smaller than 1.0 due to the limitations of the recognition engine). Therefore, because of the following reasons, retrieval in multimedia databases is inherently fuzzy: · similarity of media features, such as correlation between color (red vs orange) or shape (circle vs ellipse) features; · imperfections in the feature extraction algorithms, such as the high error rate in motion estimation due to the multitude of factors involved, including camera and object speed, and camera eects; · imperfections in the query formulation methods, such as the query by example (QBE) method where user provides an example but is not aware of which features will be used for retrieval; · partial match requirements, where objects in the database fail to satisfy all requirements in the query; · imperfections in the available index structures, such as low precision or recall rates due to the imperfections in clustering algorithms. In many multimedia applications, more than one of these reasons coexist and, consequently, the system must take each of them into consideration. This requires quanti®cation of dierent sources of fuzziness and merging into a single combined value for user's reference. The following example describes this requirement in greater detail.
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
261
Example 1.2. A query for retrieving images containing Fuji Mountain and a lake can be speci®ed with an SQL3-like query statement [3±5] as follows: select image P, object object1, object object2 where P contains object1 and P contains object2 and object1.semantical_property s_like ``mountain'' and object1.image_property image_match ``Fuji_mountain.gif '' and object2.semantical_property is ``lake'' and object2.image_property image_match ``lake_image_sample.gif '' and object1.position is_above object2.position The above query contains two crisp query predicates: contains and is. It also contains a set of fuzzy query predicates: · s_like (i.e., semantically similar) evaluates the degree of semantic similarity between two terms. Helps resolving correlations between semantic features, imperfections in the semantics extraction algorithms, in the index structures, and in the user queries. · image_match (i.e., visually like) evaluates the visual similarity between two images. Helps resolving correlations between visual features and imperfections in the index structures. · is_above (a spatial condition) compares the spatial position between two objects. Helps resolving correlations between spatial features, imperfections in the spatial information extraction algorithms, imperfections in the index structures, and imperfections in the user queries. This query returns a set of 3-tuples of the form hP ; object1; object2i that satisfy all crisp conditions and that has a combined fuzzy score above a given threshold. If users desire, the results may be sorted based on their overall scores for quicker access to relevant results. Fig. 2(Query) shows the conceptual representation of the above query. Fig. 2(a), (b), (c), and (d) shows examples of candidate images that may match this query. The numbers next to the objects in these candidate images denote the similarity values for the object-level matching. As explained earlier, in this example, the comparisons on spatial relationships are also fuzzy to account for correlations between spatial features. The candidate image in Fig. 2(a) satis®es object matching conditions but its layout does not match user speci®cation. Fig. 2(b) and (d) satis®es image layout condition but objects do not perfectly match the speci®cation. Fig. 2(c) has structural and object matching with low scores. Note that in Fig. 2(a), the spatial predicate, and in Fig. 2(c), the image similarity predicate for lake completely fail (i.e., the match is 0.0).
Fig. 2. Partial matches.
262
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
A multimedia database engine must consider all four images as candidates and must rank them according to a certain uni®ed criterion. In this paper, we ®rst address the problem of ``given a query which consists of multiple fuzzy and crisp predicates, how to provide a meaningful ®nal ranking to the users.'' We propose an alternative scoring approach which captures the multimedia semantics well and which does not face the above problem while handling partial matches. Although it is not based on weighing, the proposed approach can be used along with weighing strategies, if weighing is requested by the user. We then focus on the problem of ``given a query which consists of multiple fuzzy and crisp predicates and a scoring function, how to eciently process the query.'' It is clear that current database engines are not designed to answer the needs of these kinds of queries. Recently, there have been attempts to address challenges associated with processing queries of the above kind. Adalõ et al. [9] propose an algebra for similarity-based queries. Fagin [10,11] proposes a set of ecient query execution algorithms for databases with fuzzy (similarity-based) queries. The algorithms proposed by Fagin assume that (1) individual sources can progressively (in decreasing order of score) output results, and (2) users are interested in the best k matches to the query. This for instance would require both of the s_like and image_match predicates, used in the above example, to return ordered results. This assumption, however, may be invalid due to limited binding and processing capabilities of the sources. For instance, the second assumption may be invalid due to the binding rules imposed by the predicates. Example 1.3 (Motivating example). Consider the following SQL-like query which asks for 10 pairs of visually similar images, such that both images in a given pair contains at least one object, a ``mountain'' and a ``tree'', respectively: select 10 P1, P2 where semantically like(P1.semantical_property, ``mountain'') and semantically like (P2.semantical_property, ``tree'') and image match (P1.image_property, P2.image_property). This query contains three fuzzy conditions: two semantically like predicates and one image match predicate. Let us assume that the image match predicate is implemented as an external function, which can be invoked only by providing two input images (i.e., both arguments have to be bound). The image match predicate then returns a score denoting the visual similarity of its inputs. Hence, the predicate is fuzzy, but it cannot generate results in the order of score; i.e., it is a non-progressive fuzzy predicate. Let us also assume that the semantically like predicate is implemented as an external function such that, when invoked with the second argument bound, can return matching images (using an index structure) in the order of decreasing scores. Consequently, in this example, we have two sources
semantically like predicates) which can output images progressively through database access and one source
image match which cannot, and results from all these sources have to be merged to get the ®nal set of results.
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
263
Unfortunately, due to the existence of a non-progressive predicate, ®nding and returning the 10 best matching pairs of pictures would require a complete scan of the database, which is clearly undesirable. In this paper, we propose a query processing approach which uses score distribution function estimates/statistics [4] and the statistical properties of the score merging functions for computing approximate top-k results when some of the predicates cannot return ordered results. More speci®cally, we propose a query processing algorithm, which, given a query, Q, uses the available score distribution function estimates/statistics [4], s, of the individual predicates and the statistical properties of the corresponding score merging function (lQ s) for computing approximate top-k results when some of the predicates in Q cannot return ordered results. More speci®cally, we provide a solution to the following problem: ``Given a query, Q, a positive number k, and an error threshold, H, return a set R of k results, such that each result r 2 R is most probably
prob
r 2 Rk > 1 ÿ H in the top-k results, Rk .'' The paper is structured as follows. In Section 2, we provide an overview of the multimedia retrieval semantics. We show the similarities between multimedia queries and fuzzy logic statements. Then, in Section 3, we provide an overview of the popular fuzzy logic semantics and compare them with respect to the essential requirements of multimedia retrieval problem. In Section 4, we propose an algorithm for generating approximate results for queries with nonprogressive fuzzy predicates. In Section 5, we investigate the statistical properties of popular fuzzy logic semantics and we describe a generic function that approximates score distributions used in the algorithm, when such distributions are not readily available. In Section 6, we experimentally evaluate the proposed algorithm. In Section 7, we compare our approach with existing work. Finally, we present our concluding remarks. 2. Multimedia retrieval semantics In this section, we ®rst provide an overview of the multimedia retrieval semantics; i.e., we describe what we mean by retrieval of multimedia data. We then review some of the approaches to deal with multimedia queries that involve multiple fuzzy predicates. 2.1. Fuzziness in multimedia retrieval It is possible to classify the fuzziness in the multimedia queries into three categories: precisionrelated, recall-related, and partiality-related fuzziness. Precision related class captures fuzziness due to similarity of features, imperfections in the feature extraction algorithms, imperfections in the query formulation methods, and the precision rate of the utilized index structures (Fig. 3). Recall- and partiality-related classes are self explanatory. Note that, in information retrieval, the precision/recall values are mainly used in evaluating the eectiveness of a given retrieval operation or the eectiveness of a given index structure. Here, we are using these terms more as statistics which can be utilized to estimate the quality of query results. We have used this approach, in SEMCOG image retrieval system [3±5], to provide preand post-query feedback to users. The two examples given below, show why precision and recall are important in processing multimedia queries.
264
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
Fig. 3. Clustering error which results in imperfections in the index structures: the squares denote the matching objects, circles denote the non-matching objects, and the dashed rectangle denotes the cluster used by the index for ecient storage and retrieval.
Example 2.1 (Handling precision related fuzziness). Let us assume that we are given a query of the form Q
X
s like
man; X :semantic property ^ image match
X :image property; \a:gif ":
Let for a given object I, the corresponding semantic property be woman and image property be a im:gif . Let us assume that the semantic precision of s like
man; woman is 0.8 and the image matching precision image match (``im.gif '', ``a.gif '') is 0.6. This means that the index structure and semantic clusters used to implement the predicate s like guarantee that 80% of the returned results are semantically similar to man. These semantic similarities can be evaluated using various algorithms [13,14]. Similarly, the predicate image match guarantees that 60% of the returned results are visually similar to \a:gif ". Then, assuming that the two predicates are not correlated, Q
I should be 0:8 0:6 0:48: by replacing X in s like
man; X with woman, we maintain 80% precision; then, by replacing the X in image match
X ; \a:gif " with I, we maintain 60% of the remaining precision. The ®nal precision (or con®dence) is 0.48. Example 2.2 (Handling recall related fuzziness). Recall rate is the ratio of the number of returned results to the ratio of the number of all relevant results. Let us assume that we have the query used in the earlier example. Let us also assume that s like
man; X returns 60% and image match
X ; \a:gif " returns 50% of all applicable results in the database. Let us further assume that both functions work perfectly when both variables are bound. Then (1) a left to right query execution plan would return 0:6 1:0 0:6, (2) a right to left query execution plan would return 1:0 0:5 0:5, and a parallel execution of the predicates followed by a join would return 0:6 0:5 0:3 of all the applicable results. Consequently, given a query execution plan and assuming that the predicates are independent, the recall rate can be found by multiplying the recall rates of the predicates. 2.2. Query semantics in the presence of imperfections and similarities Traditional query languages are based on Boolean logic, where each predicate is treated as a propositional function, which returns one of the two values: true or false. However, due to the stated imperfections, predicates related to visual or semantic features do not correspond to
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
265
propositional functions, but to functions which return values between 0.0 and 1.0. Consequently, the solution of a query of the form Q
Y1 ; . . . ; Yn
H
p1
Y1 ; . . . ; Yn ; . . . ; pm
Y1 ; . . . ; Yn ;
where pi s are fuzzy or crisp predicates, H a logic formula, and Yj are free variables. The solution, on the other hand, is de®ned as an ordered list SQ of n-tuples of the form hX1 ; X2 ; . . . ; Xn i, where (1) n is the number of variables in query Q, (2) each Xi corresponds to a variable Yi in Q, and (3) each Xi satis®es the type constraints of the corresponding predicates in Q. The order of the list S denotes the relevance ranking of the solutions. The ®rst, trivial, way to process multimedia queries is by transforming similarity functions into propositional functions by choosing a cut-o point, rtrue and by mapping all numbers in 0:0; rtrue to false and numbers in rtrue ; 1:0 to true. Intuitively, such a cut-o point will correspond to a similarity degree which denotes dissimilarity. The main advantage of this approach is that, predicates can refute (conjunctive queries) or validate (disjunctive queries) solutions as soon as they are evaluated so that optimizations can be employed. Chaudhuri and Gravano [15] discuss cost-based query optimization techniques for such filter queries. 1 If users only consider full matching, this method is preferred because it allows query optimization by early pruning of the search space. However, when partial matches are also acceptable, this approach fails to produce appropriate solutions. For instance, in Example 1.1, candidate images shown in Fig. 2(a) and (c) will not be considered. The second way to process multimedia queries is to leave the ®nal decision, not to the constituent predicates but, to the n-tuple as a whole. This can be done by de®ning a scoring function lQ which maps a given object to a value between 0.0 and 1.0. In this method a candidate object, o, is returned as a solution if lQ
o P soltrue , where soltrue is the solution acceptance threshold. Since determining an appropriate threshold is not always possible, an alternative approach is to rank all candidate objects according to their scores and return the ®rst k candidates, where k is a userde®ned parameter. Related problems are studied within the domains of fuzzy set theory [16], fuzzy relational databases [17], and probabilistic databases [18]. If users are also interested in partial matches, the second method is more suitable. The techniques proposed in [15] do not address queries with partial matches since for a given object if any ®lter condition fails, the object is omitted from the set of results. Furthermore, the proposed algorithms are tailored towards the min semantics which, as discussed later in this section, despite the many proven advantages, are not suitable for multimedia applications. In this paper, we focus on the second approach. As discussed above, multimedia predicates, by their nature, have associated scoring functions. Fuzzy sets and the corresponding fuzzy predicates also have similar scoring or membership functions. Consequently, we will examine the use of fuzzy logic for multimedia retrieval. The major challenge with the use of the second approach is that early ®ltering cannot always be applied. As mentioned earlier, Fagin [10,11] has proposed algorithms for ecient query processing when all fuzzy predicates are ordered. In contrast, our aim is to develop an algorithm that uses statistics to eciently process such queries even when not all of the fuzzy predicates are ordered. 1
Their techniques assume that queries do not contain negation.
266
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
3. Application of fuzzy logic for multimedia databases In this section, we provide an overview of fuzzy logic and, then, we introduce properties of dierent fuzzy logic operators. A fuzzy set, F, with domain D can be de®ned using a membership function, lF : D ! 0; 1. A crisp (or conventional) set, C, on the other hand, has a membership function of the form lC : D ! f0; 1g. When for an element d 2 D, lC
d 1, we say that d is in C (d 2 C), otherwise we say that d is not in D (d 62 C). Note that a crisp set is a special case of a fuzzy set. A fuzzy predicate is de®ned as a predicate which corresponds to a fuzzy set. Instead of returning true(1) or false(0) values for propositional functions (or conventional predicates which correspond to crisp sets), fuzzy predicates return the corresponding membership values. Binary logical operators (^; _) take two truth values and return a new truth value. The unary logical operator, :, on the other hand takes one truth value and returns an other truth value. Similarly, binary fuzzy logical operators take two values between 0.0 and 1.0 and return a third value between 0.0 and 1.0. A unary fuzzy logical operator, on the other hand, takes one value between 0.0 and 1.0 and returns an other value between 0.0 and 1.0. 3.1. Relevant fuzzy logic operator semantics There are a multitude of functions [19±21], each useful in a dierent application domain, proposed as semantics for fuzzy logic operators (^; _; :). In this section, we introduce the popular scoring functions, discuss their properties, and show why these semantics may not be suitable for multimedia retrieval. Two of the most popular scoring functions are the min and product semantics of fuzzy logical operators. We can state these two semantics in the form of a table as follows: given a set, P fP1 ; . . . ; Pm g of fuzzy sets and F fl1
x; . . . ; lm
xg of corresponding membership functions, Table 1 shows the min and product semantics. These two semantics (along with some others) have the following properties: de®ned as such, binary conjunction and disjunction operators are triangular-norms (t-norms) and triangularconorms (t-conorms). Table 2 shows the properties of t-norm and t-conorm functions. Intuitively, t-norm functions re¯ect the properties of the crisp conjunction operation and t-conorm functions re¯ect those of the crisp disjunction operation. Although the property of capturing crisp semantics is desirable in many cases, for multimedia applications, this is not always true. For instance, the partial match requirements invalidate the boundary conditions. In addition, monotonicity is too weak a condition for multimedia applications. An increase in the score of a single query criterion should increase the combined score; Table 1 Min and products semantics for fuzzy logical operators Min semantics
Product semantics
lPi ^Pj
x minfli
x; lj
xg
i j lPi ^Pj
x maxfl i
x;lj
x;ag
lPi _Pj
x maxfli
x; lj
xg
lPi _Pj
x
l:Pi
x 1 ÿ li
x
l:Pi
x 1 ÿ li
x
l
xl
x
a 2 0; 1
li
xlj
xÿli
xlj
xÿminfli
x;lj
x;1ÿag maxf1ÿli
x;1ÿlj
x;ag
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
267
Table 2 Properties of triangular-norm and triangular-conorm functions T-norm binary function N (for ^)
T-conorm binary function C (for _)
Boundary conditions Commutativity
N
0; 0 0; N
x; 1 N
1; x x N
x; y N
y; x
C
1; 1 1; C
x; 0 C
0; x x C
x; y C
y; x
Monotonicity Associativity
x 6 x0 ; y 6 y 0 ! N
x; y 6 N
x0 ; y 0 N
x; N
y; z 6 N
N
x; y; z
x 6 x0 ; y 6 y 0 ! C
x; y 6 C
x0 ; y 0 C
x; C
y; z 6 C
C
x; y; z
whereas the monotonicity condition dictates such a combined increase only if the scores for all of the query criteria increases simultaneously. A stronger condition (N
x; y increases even if only x or only y increases) is called strictly increasing property. 2 Clearly, min
x; y is not strictly increasing. Another desirable property for fuzzy conjunction and disjunction operators is distributivity. The min semantics is known [10,22,23] to be the only semantics for conjunction and disjunction that preserves logical equivalence (in the absence of negation) and be monotone at the same time. This property of the min semantics makes it the preferred fuzzy semantics for most cases. Furthermore, in addition to satisfying the properties of being t-norm and t-conorm, the min semantics also has the property of being idempotent. Although it has nice features, the min semantics is not suitable for multimedia applications. As discussed in Example 1.2, according to the min semantics, the score of the candidate images given in Fig. 2(a) and (c) would be 0.0 though they partially match the query. Furthermore, scores of the images in Fig. 2(b) and (d) would both be 0.5, though Fig. 2(d) intuitively has a higher score. The product semantics [20], on the other hand, satis®es idempotency only if a 0. On the other hand, when a 1, it has the property of being strictly increasing (when x or y is dierent from 1) and Archimedean
N
x; x < x and C
x; x > x. The Archimedean property is weaker than the idempotency, yet it provides an upper bound on the combined score, allowing for optimizations. 3.2. n-ary Operator semantics In information retrieval research (which also shows the characteristics of multimedia applications), other fuzzy semantics, including the arithmetic mean [25] are suggested. The arithmetic mean semantics (Table 3) provides an n-ary scoring function
jfPi ; . . . ; Pj gj n. Note that the binary version of arithmetic mean does not satisfy the requirements of being a t-norm: it does not satisfy boundary conditions and it is not associative. Hence, it does not subsume crisp semantics. On the other hand, it is idempotent and strictly increasing. Arithmetic average semantics emulate the behavior of the dot product-based similarity calculation popular, in information retrieval: eectively, each predicate is treated like an independent dimension in an n-dimensional space (where n is the number of predicates), and the merged score is de®ned as the dot-product distance between the complete truth, h1; 1; . . . ; 1i, and the given values of the predicates, hl1
x; . . . ; ln
xi. Although this approach is shown to be suitable for 2
This is not the same de®nition of strictness used in [10].
268
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
Table 3 n-ary Arithmetic average and geometric average semantics lPi 1^^Pi n
x li 1
xli n
x n
1=n
li 1
x li n
x
l:Pi
x
lPi 1__Pi n
x
1 ÿ li 1
x 1 ÿ li 1
x
1ÿ
1ÿli 1
x
1ÿli n
x n
1 ÿ
1 ÿ li 1
x
1 ÿ li n
x1=n
many information retrieval applications it does not capture the semantics of multimedia retrieval applications, introduced in Section 2, which are multiplicative in nature. Therefore, we can use the n-ary geometric average semantics instead. Note that, as it was the case in the original product semantics, the geometric average semantics is also not distributive. Therefore, if Q
Y1 ; . . . ; Yn is a query which consists of a set of fuzzy predicates, variables, constants, and conjunction, disjunction, and negation operators, and if Q_
Y1 ; . . . ; Yn is the disjunctive normal representation of Q, then, we de®ne the normal fuzzy semantics of Q
Y1 ; . . . ; Yn as the fuzzy semantics of Q_
Y1 ; . . . ; Yn . In general, lQ 6 lQ_ . The former semantics would be used when logical equivalence of queries is not expected. The latter, on the other hand, would be used when the logical equivalence is required. 3.3. Accounting for partial matches Both the min and the geometric mean functions have weaknesses in supporting partial matches. When, one of the involved predicates returns zero, then both of these functions return 0 as the combined score. However, in multimedia retrieval, partial matches are required (see Section 1). In such cases, having a few number of terms with 0 score value in a conjunction should not eliminate the whole conjunctive term from consideration. One proposed [6] way to deal with the partial match requirement is to weigh dierent query criteria in such a way that those criterion that are not important for the user are eectively omitted. For instance, in the query given in Fig. 2(Query), if the user knows that spatial information is not important, then the user can choose to provide a lower weight to spatial constraints. Consequently, using a weighting technique, the image given in Fig. 2(a) can be maintained though the spatial condition is not satis®ed. This approach, however, presupposes that users can identify and weigh dierent query criteria. This assumption may not be applicable to many situations, including databases for naive users or retrieval by QBE. Furthermore, it is always possible that for each feature or criterion in the query, there may be a set of images in the database that fails it. In such a case, no weighting scheme will be able to handle the partial match requirement for all images. Example 3.1. Let us assume that a user wants to ®nd all images in the database that are similar to image Iexample . Let us also assume that the database uses three features, color, shape, and edgedistribution to compare images, and that the database contains three images, I1 , I2 , and I3 . Finally let us assume that the following table gives the matching degrees of the images in the database for each feature:
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
269
Image
Shape
Color
Edge
I1 I2 I3
0.0 0.8 0.9
0.9 0.0 0.8
0.8 0.9 0.0
According to this table, it is clear that if the user does not specify a priority among the three features, the system should treat all three candidates equally. On the other hand, since for each of the three features, there is a dierent image which fails it completely, even if we have a priori knowledge regarding the feature distribution of the data, we cannot use feature weighing to eliminate low scoring features. To account for partial matches, we need to modify the semantics of the n-ary logical operators and eliminate the undesirable nullifying eect of [3]. Note that a similar modi®cation can also be done for the min semantics. Given a set, P fP1 ; . . . ; Pm g of fuzzy sets and F fl1
x; . . . ; lm
xg of corresponding scoring functions, the semantics of n-ary fuzzy conjunction operator is as follows: Q Q 1=n
lk
t P rtrue lk
t
lk
t
270
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
Fig. 4. Comparison of dierent scoring mechanisms.
images denote the similarity values for the object-level matching. The ®gure shows the score of the candidate images as well as their relative ranks. The cut-o parameters used in this example are rtrue 0:4 and b 0:4, and the structural weights are a1 0:8 and a2 0:2. 4. Evaluation of queries with unordered fuzzy predicates In the earlier sections, we have investigated characteristics of multimedia retrieval and semantic properties of dierent fuzzy retrieval options. In this section, we focus on the query processing requirements for multimedia retrieval and provide an ecient algorithm. Although the algorithm is independent of the chosen semantics of the fuzzy logic operators described above, it uses their statistical properties to deal with unknown system parameters. 4.1. Essentials of multimedia query processing Recently, many researchers studied query optimization (mostly through algebraic manipulations) in non-traditional forms of databases. Refs. [12,27±30] provide overviews of techniques used for query processing and retrieval in such databases. Solutions in such non-traditional databases vary from the use of database statistics and domain knowledge to facilitate query rewriting and intelligent use of cached information [24] to the use of domain knowledge to discover and prune redundant queries [31]. Li et al. [4] also used o-line feedback mechanisms to prevent
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
271
users from asking redundant or irrelevant queries. Chaudhuri and Gravano [15] discuss query optimization issues and ranking in multimedia databases. Chaudhuri and Shim [32] discuss approaches for query processing in the presence of external predicates, or user-de®ned functions which are very common in multimedia systems. The above work and others in the literature collectively point to the following essential requirements for multimedia query processing: · As discussed in earlier sections, fuzziness is inherent in multimedia retrieval due to many reasons including similarity of features, imperfections in the feature extraction algorithms, imperfections in the query formulation methods, partial match requirements, and imperfections in the available index structures. · Users are usually not interested in a single result, but k P 1 ranked results, where k is provided by the user. This is mainly due to the inherent fuzziness, users want to have more alternatives to choose what they are interested in. · We would prefer to generate kth result, after we generate
k ÿ 1th result, as progressively as possible. · Since the solution space is large, we cannot perform any processing which would require us to touch or enumerate all solutions. Fagin [10,11] proposes a set of ecient query execution algorithms for databases with fuzzy queries. These algorithms assume that: · the query has a monotone increasing combined scoring function; · individual sources can progressively (in decreasing order of score) output results; · the user is interested in the best k matches to the query. If all these conditions hold, then these algorithms can be used to progressively ®nd the best k matches to the given query. Note that, if the min semantics for conjunction is used and if the query does not contain negation, then scoring function of queries are guaranteed to be monotone increasing. Similarly, for arithmetic average, product, and geometric average semantics (if the query does not contain negation, then), the combined scoring function will be monotone. Consequently, the algorithms proposed by Fagin can be applied. 4.2. Negation If the query contains negation, on the other hand, then the scoring function may not be monotone increasing, invalidating one of the assumptions. This, however, can be taken care of if we can assume that some of the sources (the negated ones) can also output results in increasing order of score. Since a multimedia predicate is more likely to return lower scores, the execution cost of such a query is expected to be higher. Although the algorithm we introduce in this paper can take into account negated goals, when such an index is available, the actual focus of the algorithm is to deal with unordered sub-goals. 4.3. Unordered sub-goals The second assumption can also be invalid for various reasons, including the binding rules imposed by the predicates.
272
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
Example 4.1. For example, consider the following query which is aimed at receiving all pairs of images, each containing at least one object, a ``mountain'' and a ``tree'', respectively, and that are visually similar to each other: select image P1, P2 where P1.semantical_property s_like ``mountain'' and P2.semantical_property s_like ``tree'' and P1.image_property image_match P2.image_property The above query contains three fuzzy conditions (two s_like predicates and one image_match predicate). Let us assume that the image_match predicate is implemented as an external function, which can be invoked only by providing two input images. The image_match predicate then returns a score denoting the visual similarity of its inputs. In this case, we have two sources (s_like predicates) which can output images progressively through database access and one source (image_match) which cannot. As shown in Fig. 5, because of such non-progressive fuzzy predicates, ®nding and returning the k best matching results may require a complete scan of the database. Consequently, in order to avoid the complete scan of the database, our algorithm uses the score distribution estimates/ statistics of the individual predicates and the statistical properties of the score merging function to compute an approximate set of top-k results. Chaudhuri and Gravano [15] discuss query optimization issues and ranking in multimedia databases. In their framework, they dierentiate between top search (access through an index) and
Fig. 5. The eect of having non-progressive fuzzy predicates (numbers in the ®gure denote scores of the results): In (a) all predicates are progressive (results are ordered); hence, the results can be merged using algorithms presented in [11] to ®nd the top-k ranking results. In (b) only two of the three fuzzy predicates are progressive; hence, the merging can be done on two predicates only. However, in this case, the ranking generated by merging two predicates may not be equal to the ranking that would be generated by merging all three of the scores.
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
273
probe (testing the predicate for given object). Top search and probe can be looked as the same as sorted access and random access described in [11]. In Section 4.4, we provide an algorithm that has a similar search/probe structure. However, unlike the techniques proposed in [15], the algorithm we propose is aimed at dealing with queries with partial matches, is not tied with min semantics, and can deal with negation as long as sources (the negated ones) can also output results in increasing order of score. In order to deal with the non-progressive predicates, the algorithm we propose takes a probability threshold, H, as input and it returns a set, R, of k results, such that each result r 2 R is most probably
prob
r 2 Rk > 1 ÿ H in the top-k results, Rk . Note that the algorithm uses score distribution function estimates/statistics [4,24] and the statistical properties of the score merging functions for computing approximate top-k results. Note also that the algorithm is ¯exible in the sense that both strict (such a product) and monotone (such as min) semantics of fuzzy logic operators are acceptable. 4.4. Query evaluation algorithm Let Q
Y1 ; . . . ; Yn be a query and let P PO [ PU be the set of all fuzzy predicates in Q, such that predicates in PO can output ordered results and those in PU cannot output ordered results. Let the query also contain a set, PC , of crisp (non-fuzzy) predicates, including those which check the equalities of variables. Let us also assume that the user is interested in a set R of k results, such that each result r 2 R is most probably (prob
r 2 Rk > 1 ÿ H) in the top k results, Rk . The proposed query execution algorithm is given in Fig. 6. The main inputs to this algorithm are: · a query, Q, · a set of ordered predicates, PO , a set of non-ordered predicates, PU , and a set of crisp predicates, PC , · a positive integer, k, · an error threshold, h, and the output is a set, R, of k n-tuples. In the following, we provide detailed description of the algorithm. The ®rst step of the algorithm calculates the combined scoring function for the query. If this combined scoring function is not monotone (i.e., there are some negated sub-goals), the algorithm identi®es the predicates which are negated. For all negated predicates which may have an inversely ordered index (if such an index is available), the algorithm will use the results in the increased order of score. For all negated predicates which do not have such an inverse index (which is more likely) the algorithm will treat them as non-progressive predicates. The second step of the algorithm initializes certain temporary data structures. solNum keeps track of the number of candidate solutions generated, Visited is the set of all tuples generated (candidate solution or not), and Marked is the set of all candidate solutions. In steps 3(a)±3(d), the algorithm uses an algorithm similar to the one presented in [10,11], to merge results using the ordered predicates (see Fig. 5(b)). The algorithm stops when it ®nds k n-tuples which satisfy the condition in step 3(e). Note that an n-tuple, si , satis®es this condition if and only if · the probability of having another n-tuple, sj , with a better score is less than H, i.e., when P9
i prob
9j > i such that lQ
si < lQ
sj .
274
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
Fig. 6. Query evaluation algorithm.
Let us refer to this condition as F
i; H
P9
i 6 H. Intuitively, if F
i; H is true, then the probability of having another n-tuple, sj , with a better combined score than the score of si is less than H. At the end of the third step, Marked contains k n-tuples which satisfy the condition. However, in the fourth step, the algorithm revisits all the n-tuples put into Visited, to see whether there are any better solutions in those that are visited. The best k n-tuples generated during this process are returned as the output, R. Intuitively, the algorithm uses a technique similar to the ones presented in [10,11] to generate a sequence of results that are ranked with respect to the ordered predicates. As we mentioned above, this order does not necessarily correspond to the ®nal order of the predicates, because it does not take the unordered scores into account. Therefore, for each tuple generated in the ®rst stage, using the database statistics and statistical properties of lQ , the algorithm estimates the probability of having a better result in the remainder of the database. If for a given tuple, this probability is below a certain level, then this tuple is said to be a candidate to be in the top-k results. Note that, for the algorithm to work, we need to be able to calculate F
i; H for the fuzzy semantics chosen for the query. For each dierent semantics, F
i; H must be calculated in a dierent way. The following two examples show how to calculate F
i; H for product and min semantics we covered in Section 3.
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
275
Example 4.2. Let us assume the lQ is a scoring function that has the product semantics. Let lQ;o be the combined score of the ordered predicates and let lQ;u be the combined score of the unordered predicates. Then, given the ith tuple, si , ranked with respect to the ordered predicates, F
i; H is equal to F
i; H
prob
9j > i lQ;o
sj lQ;u
sj > lQ;o
si lQ;u
si 6 H
prob
8j > i lQ;o
sj lQ;u
sj 6 lQ;o
si lQ;u
si P 1 ÿ H ! ! lQ;o
si lQ;u
si P1 ÿ H : prob 8j > i lQ;u
sj 6 lQ;o
sj Example 4.3. Let us assume that the lQ is a scoring function that has the min semantics. Then, given the ith tuple, si , ranked with respect to the ordered predicates, Fmin
i; H
P9
i 6 H is equal to Fmin
i; H
prob
9j > i minflQ;o
sj ; lQ;u
sj g > minflQ;o
si ; lQ;u
si g 6 H
prob
8j > i minflQ;o
sj ; lQ;u
sj g 6 minflQ;o
si ; lQ;u
si g P 1 ÿ H: Since lQ;o is a non-increasing function, lQ;o
sj is smaller than or equal to lQ;u
si . Consequently, if lQ;o
si 6 lQ;u
si , then Fmin
i; H true else Fmin
i; H
prob
8j > i minflQ;o
sj ; lQ;u
sj g 6 lQ;u
si P 1 ÿ H: As seen in the above examples, in order to ®nd the value of F
i; H, we need to know the distributions of the scoring functions lQ;o and lQ;u . In Section 5.1, we show how these statistical values can be calculated for dierent fuzzy semantics. However, in some cases, such a score distribution function may not be readily available. In such cases, we need to approximate the score distribution using the database statistics. Obviously, such an approximation is likely to cause deviations from the expected results obtained using the algorithm. In Section 5, we describe a method for approximating score distributions. 4.5. Correctness of the algorithm In this section, we show that the expected ratio of the relevant results, within the top-k results returned by the algorithm, is within the error bounds, i.e., we prove the correctness of the algorithm. Theorem 4.1. Given an n-tuple, ri , which is in the top k n-tuples returned in R, ri is most probably (prob
r 2 Rk > 1 ÿ H) in the set, Rk , of top k results. Proof. Given the ith n-tuple, ri (i 6 k) in R, ri is either in Marked or not.
276
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
If ri 2 Marked, then it satis®es the condition in step 3(e), and the probability that there is another tuple with a better combined probability is less than or equal to H. Since i 6 k, we can conclude that the probability that the subsequent searches will yield k ÿ i 1 tuples with a better combined probability is also less than or equal to H. In other words, the probability that the subsequent searches will push ri out of Rk is less than or equal to H. Consequently, the probability that ri is in Rk is greater than 1 ÿ H. If ri 62 Marked, then there is an mi 2 Marked such that ri > mi and mi satis®es the condition in step 3(e). Consequently, the probability that ri is in Rk is again greater than 1 ÿ H. Theorem 4.2. The expected number of n-tuples that are in R that are also in Rk is greater than
1 ÿ H k. Proof. Given a set, R, of k n-tuples such that each one is in Rk with a worst-case probability 1 ÿ H (Theorem 4.1), the number, j, of the elements of R that are also in Rk is a random variable that has a binomial distribution with parameters p > H and q 6 1 ÿ H. Consequently, the expected value of j is greater than
1 ÿ H k. 4.6. Complexity of the algorithm Note that the main loop of the proposed algorithm can be iterated many times during which no tuples are added to the Marked set. The following theorem states the eect of this on the complexity of the algorithm. Theorem 4.3. If N is the database size (the number of all possible n-tuples) and m is jPO j and if prob
F
i; H true j 1 6 i 6 N is geometrically distributed with parameter F and if predicates are independent, then the expected running time of the algorithm, with arbitrarily high probability, is ! mÿ1 N m k k k log : O F F F Proof. If we assume that the probability of having one n-tuple satisfying the condition in step 3(e) is geometrically distributed with parameter F, the expected number of n-tuples to be tested until the ®rst suitable n-tuple is 1=F . In reality, F is not a constant and it tends to be higher for lower values of i. Consequently, the actual number of visited tuples is lower than the value provided by this assumption. See the results in Section 6 for details. Consequently, for k matches, the expected number of times the loop will be repeated is k=F . Step 3(a) of the algorithm takes O
N
mÿ1=m (according to Fagin [10,11]), with arbitrarily high probability, where N is the database size, i.e., the number of all possible n-tuples. Consequently, the expected running time of the while loop (or the expected number of tuples to be evaluated), with arbitrarily high probability, is O
N
mÿ1=m k=F . The ®nal selection of k best solutions among the candidates, then, takes O
k=F log
k=F time. Consequently, the expected running time of the algorithm, with arbitrarily high probability, is
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
N
mÿ1=m k k log O F F
k F
277
:
Note that when
N
mÿ1=m kF N , the proposed algorithm accomplishes its task: it visits a very small portion of the database, yet generates approximately good results. This means that if
k=F N 1=m , then the algorithm will work most eciently. In Section 6, we will show experiment results that verify these expectations. Note that the theorem assumes that prob
F
i; H true j 1 6 i 6 N is geometrically distributed. The results presented in Section 6 will show that, if we replace the geometric distribution assumption with a more skewed distribution, the complexity of the algorithm relative to the database size will be less than that predicted by Theorem 4.3. Note also that, similar to the case in [11], though positive correlations between predicates may help reduce the complexity predicted above, negative correlations between predicates may result in higher complexities. 5. Approximating the score distribution using statistics In Section 4, we have seen that in order to use the proposed query evaluation algorithm, we need to calculate the score distribution for the combined scoring functions, lQ;o and lQ;u . In this section, we discuss the dierences in score distributions for various fuzzy semantics, followed by our proposed method for approximating the score distribution using statistics. 5.1. Statistical properties of fuzzy semantics We ®rst investigate the score distributions and statistical properties of dierent fuzzy semantics. Fig. 7 depicts three mechanisms to evaluate conjunction. Fig. 7(a) depicts the geometric averaging method (which is a product followed by a root operation), Fig. 7(b) depicts the arithmetic averaging mechanism used by other researchers [25], and Fig. 7(c) is the minimum function as described by Zadeh [16] and Fagin [10,11]. In this section, we compare various statistical properties of these semantics. These properties describe the shape of the combined score distribution histograms.
Fig. 7. The eect of (a) geometric average, (b) arithmetic average, and (c) minimum function with two predicates. Horizontal axes correspond to the values of the two input predicates and the vertical axis corresponds to the value of the conjunct according to the respective function.
278
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
5.2. Relative importance of query criteria An important advantage of the geometric average, against the arithmetic average and the min functions, is that though it shows: · a linear behavior when the similarity values of the predicates are close to each other,
x y !
dl
x; y 1 1 p
x x
2x 1; dx dy 2 xx 2x
· a non-linear behavior when one of the predicates has lower similarity compared to the others, p y dl
x; y dl
x; y
x y ! p 1 0:
x y ! dx dx 2 x Example 5.1. The ®rst item in the following shows the linear increase in the score of the geometric average when the input values are closer to each other. The second item, on the other hand, shows that non-linearity of the increase when input values are dierent: ·
0:5 0:5 0:51=3 0:5,
0:6 0:6 0:61=3 0:6, and
0:7 0:7 0:71=3 0:7; ·
1:0 1:0 0:51=3 0:79,
1:0 1:0 0:61=3 0:85, and
1:0 1:0 0:71=3 0:88: It is claimed that according to real-world and arti®cial nearest-neighbor workloads, the highestscoring predicates are interesting and the rest is not interesting [26]. This implies that the min semantics which gives the highest importance on the lowest scoring predicate may not be suitable for real workloads. The geometric average semantics, unlike the min semantics, on the other hand, does not suer from this behavior. Furthermore, the eect of an increase in the score of a sub-query (due to a modi®cation/relaxation on the query by the user or the system) with a small score value is larger than an equivalent increase in the score of a sub-query with a large score value. This implies that, though the sub-queries with a high score have a larger role in determining the ®nal score, relaxing a nonsatis®ed sub-query may have a signi®cant impact on improving the ®nal score. This makes sense as an increase in a low scoring sub-query increases the interestingness of the sub-query itself. Average score: The ®rst statistical property that we consider in this section is the average score, which measures, assuming a uniform distribution of input scores, the average output score. The average score, or the relative cardinality, of a fuzzy set with respect to its discourse (or domain) is de®ned as the cardinality of the set divided by the cardinality of its discourse. We can de®ne the relative cardinality R 1 of a fuzzy set S with a scoring function l
x, where x ranges between 0 and 1 as R1
0 l
x dx=
0 dx: Consequently, the average score of conjunction semantics can be computed as shown in Table 4. Note that, if analogously de®ned, the relative cardinality of the crisp conjunction is l
false ^ false l
false ^ true l
true ^ false l
true ^ true 0001 1 : 4 4 jf
false ^ false;
false ^ true;
true ^ false;
true ^ truegj If no score distribution information is available, the average score (assuming a uniform distribution of inputs) can be used to calculate a crude estimate of the ratio of the inputs that have a score greater than a given value. However, a better choice is obviously to look at the expected score distributions of the merge functions.
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
279
Table 4 Average score of various scoring semantics Arithmetic average
Min
Geometric average
R1R1
xy=2 dy dx 0 R01 R 1 12
R1R1 minfx;yg dy dx 0 R0 1 R 1 13
R 1 R 1 p xy dy dx 0 R 10 R 1 49
0
0
dy dx
0
0
dy dx
0
0
dy dx
Score distribution: The second property we investigate is score distribution. The study of the score distribution of fuzzy algebraic operators is essential in creating histograms that can be used in the algorithm we proposed. Strong a-cut of a fuzzy set is de®ned as the set of elements of the discourse that have score values equal to or larger than a. The relative cardinality of a strong a-cut
lx^y P a of a conjunction with respect to its overall cardinality
lx^y P 0 describes the concentration of scores above a threshold a. Table 5 and Fig. 8 show the score distribution of various scoring semantics(assuming a uniform distribution of inputs). Note that when a is close to 1, the relative cardinality of the strong a-cut of the geometric average behaves like arithmetic average. 5.3. Approximation of score distributions In the previous section, we studied the score distribution of various merge functions. These distributions were generated assuming a uniform distribution of input values and can be used when there is no other information. An alternative approach, on the other hand, is to approximate the score distribution function using database statistics or domain knowledge.
Table 5 Score distribution (relative cardinality of a strong a-cut) of various scoring semantics Arithmetic average 8a3
1ÿ 3
a 6 0:5 3 4 2 ÿ 4 a 8a
a P 0:5 3 3
Min
Geometric average
1 ÿ 3 a2 2 a3
1 ÿ a3 3 a3 ln
a
Fig. 8. Score distribution of various score merging semantics (lower curve min, middle curve geometric average, and upper curve arithmetic average).
280
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
Fig. 9. Various score distributions: (a) uniform distribution, (b) Zipf's distribution, (c) vector-space distribution.
5.3.1. Score distributions There are various ways in which the scoring function of a predicate can behave. The following is a selection of possible distributions: Uniform distribution of scores: In this case, the probability that a fuzzy predicate will return a score within the range a a1 ; ; k k where 0 6 a < k, is 1=k. Fig. 9(a) shows this behavior. Zipf 's distribution of scores: It is generally the case that multimedia predicates have small number of high-scoring and a very high number of low-scoring inputs. For instance, according to Zipf's distribution [33,34], the probability that for a given data element, a fuzzy predicate will return a score within the range a a1 ; ; k k where 0 6 a < k, is 1=
a 1 log
1:78k (Fig. 9(b)). Vector-space distribution of scores: In information retrieval or multimedia databases, documents or media objects are generally represented as points in an n-dimensional space. This is called the vector-model representation of the data [12]. The predicates use the distance (Euclidean, cityblock, etc.) between the points in this space to evaluate the dissimilarities of the objects. Therefore, the further apart the objects are from each other the more dissimilar they are. Hence, given two objects, o1 and o2 , in the vector space, their similarity can be measured as sim
o1 ; o2 1 ÿ
D
o1 ; o2 ; max D
where D
o1 ; o2 is the distance between o1 and o2 and max D is the maximum distance between any two points in the database. Note that sim
o1 ; o1 is equal to 1:0 and the minimum possible score is 0.0. Given a nearest-neighbor query (which is a very common type of query that asks for objects that fall near a given query point) q, we can divide the space into k using k spheres each of which is
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
281
Fig. 10. The equi-distance spheres enveloping a query point in a three-dimensional space.
max D=k apart from each other. Fig. 10 shows three spheres enveloping a query point in a threedimensional space. In this example, each of these spheres is
max D=10 units apart from each other, i.e., k is equal to 10. Note that in an n-dimensional space, the volume of a sphere of radius r is C rn , where C is some constant (for example, in two-dimensional space, the area of a circle is pr2 and in a threedimensional space, the volume of a sphere is
4=3pr3 . Consequently, in an n-dimensional space, the volume between two consecutive, ith and
i ÿ 1th, spheres can be calculated as n n ÿ max D max D ÿ C
i ÿ 1 C 0 inÿ1 o inÿ1 : C i k k Hence, assuming that the points are uniformly distributed in the space, the expected ratio of points that fall into this volume to the points that fall into the next larger slice is approximately
i=
i 1nÿ1 : Therefore, if the number of points in the innermost sphere is I, then the number of points in · the second slice is O
I 2nÿ1 , · the third slice is O
I 3nÿ1 , · the fourth slice is O
I 4nÿ1 , and so on. Hence, assuming uniform distribution of points in the space, the probability that for a given data element, a fuzzy predicate will return a score within the range a a1 ; ; k k where 0 6 a < k, is M
k ÿ anÿ1 , where M is a positive constant and n is the number of dimension in the vector space. Fig. 9(c) shows this behavior. Note that, when the number of dimensions is higher, the curve becomes steeper. Clustered score distributions: In the real world, the assumption that points are uniformly distributed across the vector space does not always hold. Fig. 11 provides two example score distributions from an image database, ImageRoadMap. This particular database contains 25,000 images. The retrieval predicate omits those images that are beyond a certain threshold distance
282
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
Fig. 11. Two example queries to an image database.
from the query point and, then, it appropriately scales the scores to cover the range 0; 1. Therefore, though not shown in the ®gure, the bin corresponding to 0 score is, in actuality, very large. Fig. 11(a) and (b) shows a rapid increase in the score distribution as predicated by the zip®an and vector-based models. However, the ®gure shows another phenomenon unpredicted by these models: the distribution starts decreasing after a point instead of continuously increasing. This is due to the fact that points in the vector space are not uniformly distributed; instead, they tend to form clusters. This is especially apparent in Fig. 11(b), where there are not only one, but two local maximas, corresponding to two dierent clusters. Note, however, that the fact that clusters exist in the vector space does not mean that the vector space model described earlier cannot be used to reason about score distribution: since the volume between slices in the vector space increases as we get further away from the center, slices away from the query point is likely to contain more clusters than the closer ones (assuming that clusters themselves are uniformly distributed). Therefore, a zip®an or vector-based model can be used to model an envelop curve for a large vector space with clusters of points (Fig. 12). Merged score distributions: When cached results and materialized views are used for answering queries or sub-queries, a single predicate in a given query may correspond to a cached combination of multiple sub-predicates put together (in advance) using a merge function. As we have seen in Section 5.1, even when their inputs are uniformly distributed, the score distributions cor-
Fig. 12. A clustered distribution and the enveloping curve.
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
283
responding to such score merge functions show a skewed behavior: the number of low-scoring inputs is much larger than the number of high-scoring inputs. Summary: Since, in most cases the number of low-scoring inputs is much larger than the number of high-scoring inputs, in the rest of the paper, we will focus our attention to these kind of scoring functions. However, this does not mean that scoring functions with other behaviors cannot exist. 5.3.2. Selection of an appropriate appropriate scoring function Since in this section we concentrate on score distributions where the number of low-scoring inputs is much larger than the number of high-scoring inputs, in order to model and approximate lQ;o and lQ;u , we need to ®nd a generic scoring function which has a similar behavior. Clearly, various functions can be used for approximating such scoring functions. We see that two good candidates are: b /; rank a
a
rank fa;b;/
b
rank b fa;b;/
max rank ÿ rank max rank
a /:
Intuitively, in both cases, f
i gives the ith highest score of the predicate. Depending on the values of a, b, and / parameters, these functions can describe rapidly and slowly decreasing scoring functions. The advantage of the ®rst function over the second one is that it does not need the maximum rank information (max rank) and it can be evaluated much easily. Note also that the ®rst function can describe a large set of curvatures, both concave and convex (Fig. 13(a), (b), (c), and (d)). Therefore, we use f a in the rest of this paper. Fig. 13(e) shows two possible approximations, each with dierent parameters, for the combined score of a query of the form P
X ^ Q
Y , where P
X and Q
Y are ordered and both satisfying Zipf's distribution. This ®gure shows that the function does not only approximate the predicate scores, but can also approximate combined scores for a query. Therefore, we can conclude that the proposed approximation function (1) is not limited to the predicates with a Zipfian distribution and (2) can handle merged scores with non-uniformly distributed inputs. 5.3.3. Calculating F
i; H using the approximate scoring function Note that an approximate scoring function, fa;b;/
i, for the ith highest score, is not enough for the algorithm proposed in Section 4.4. Instead, the algorithm needs the value of F
i; H, 3 assuming that the behaviors of lQ;o and lQ;u are approximated with functions fao ;bo ;/o
x and fau ;bu ;/u
x Let us assume that lQ is a scoring function that has the product semantics. Let lQ;o be the combined score of the ordered predicates and let lQ;u be the combined score of the unordered predicates. Let lQ;o
x and lQ;u
x be approximated with functions fao ;bo ;/o
x and fau ;bu ;/u
x. Then, 3 Reminder: If F
i; H is true, then given a tuple si , the probability of having another tuple, sj , with a better combined score than the score of si , is less than H.
284
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
Fig. 13. The ®rst score approximation function: two functions (a) and (c) and the corresponding score distributions (b) and (d); (e) approximation of the combined score of a query, P
X ^ Q
Y .
given the ith tuple, si , ranked with respect to the ordered predicates in Q, F
i; H
P9
i 6 H is equal to:
prob
8j > i lQ;o
sj lQ;u
sj 6 lQ;o
si lQ;u
si P 1 ÿ H; lQ;o
si lQ;u
si prob 8j > i lQ;u
sj 6 lQ;o
sj prob 8j > i lQ;u
sj 6
lQ;o
si lQ;u
si bo jao
/o
!
! P1ÿ H ;
!
! P1ÿ H ;
lQ;o
si lQ;u
si
j ao prob 8j > i lQ;u
sj 6 P1 ÿ H bo /o
j ao
or in other words ! N Y ai j bi prob lQ;u
sj 6 P1 ÿ H : cjd ji1 Note that ai ; bi ; c and d are used as shorthands. To simplify the calculations, let us split the above inequality into two parts as follows: F
i; H
Si P 1 ÿ H;
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
where Si
N Y ji1
285
ai j bi prob lQ;u
sj 6 : cjd
Since lQ;u
sj is the combined score of the unordered predicates, given a rank j calculated with respect to the ordered predicates, we cannot estimate the value of lQ;u
sj . However, using the function, fau ;bu ;/u , we can estimate the implicit rank lj at which fau ;bu
lj is equal to ji;j
ai j bi =
c j d as follows: fau ;bu
lj
ai j bi ; cjd
bu ai j bi ; /u l j au cjd
which gives us lj
a0i j b0i : c0i j di0
Note, on the other hand, that, as shown as the shaded region in Fig. 14, the ratio of all tuples, sj , such that lQ;u
sj 6 ji;j is
N ÿ lj =N . Hence, we have Si prob
lQ;u
sj 6 jij
N ÿ lj : N
Since this ratio corresponds to a probability, we must make sure that 0:0 6
N ÿ lj =N 6 1:0. In other words 0 1 ai j b0i 0:0 6 1 ÿ 6 1:0; N c0i j di0 which means that we can ®nd two limits such that
Fig. 14. The ratio of the unordered tuples that could satisfy the constraint is
N ÿ lj =N .
286
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298 > l? i 6 j 6 li :
Consequently: · If i 1 6 l? i , then there will be at least one i 1 6 j 6 N such that
N ÿ lj =N will be 0.0. Hence, Si 0:0 < 1 ÿ H; and assuming that the error ratio, H, is less than 1:0, the condition F
i; H false. · If, on the other hand, i 1 P l> i , for all i 1 6 j 6 N ,
N ÿ lj =N will be 1:0. Hence, Si 1:0 P 1 ÿ H and F
i; H true. > · Finally, if neither is the case, if l> i < N , then since for all j P li
N ÿ lj =N 1:0, we have >
li N Y Y N ÿ lj N ÿ lj Si : N N ji1 ji1
Therefore, Si can be rewritten as limi Y N ÿ lj ; N ji1
where limi minfl> i ; N g. Note that we can further rewrite Si as limi Y a00i j b00i ; c00 j di00 ji1 i
where a00i ; b00i ; c00i and di00 are used as shorthands. 4 Since we have guaranteed that the ratio,
a00i j b00i =
c00i j di00 is always between 0.0 and 1.0, we can convert the product into summation by taking the logarithm of both sides of the equation 00 limi X ai j b00i : ln ln
Si c00i j di00 ji1 Note that since it is not straightforward to solve the above summation, we can instead replace the equality with two inequalities that bound the value of ln
Si from above and below. These two inequalities are Z limi 00 Z limi ÿ1 00 ai j b00i ai j b00i ln
Si 6 ln ln dj and ln
Si P dj: c00i j di00 c00i jdi00 i i1
4
Si can be calculated as ÿ1 ÿ1 ÿ i1 ÿ1 ÿ1 b00 d 00 ÿ d 00 b00 a00i limC lim i00 c00i i 1C i 1 i00 c00i lim C lim i00 a C i 1 i00 ; i i i ai ci ci ai
where C
x
x ÿ 1!. But, we will provide an alternative formulation.
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
287
Using the second of the two inequalities, we can get Z limi ÿ1 00 ai j b00i ln dj Ti : Si P exp c00i j di00 i Since F
i; H
Si P 1 ÿ H, we can conclude that F
i; H
Si P 1 ÿ H: Consequently, putting altogether the dierent cases encountered so far, we have · if i 1 6 l? i then F
i; H false; · else if i 1 P l> i , F
i; H true; · else F
i; H
Ti P 1 ÿ H: Note that, therefore, given a tuple si , ranked ith with respect to the ordered predicates, which satisfy the boundary conditions is most probably in the result (prob
si 2 Rk > 1 ÿ H) if Ti P 1 ÿ H: Final note: If the approximate score function overestimates lQ;u , then the value of the limi (the limit rank below which the probability of ®nding a larger combined score goes to 0) is also overestimated. An important consequence of this overestimation, however, is that F
i; H may evaluate to false in more situations then necessary. Therefore, in order to balance such overestimations, we can introduce a new parameter win and set limi to i win. However, since the approximation function given in the previous section is ¯exible enough to ®t into various situations, we believe that this parameter will not be necessary in most situations. Nevertheless, in the experiments section, we experimented with dierent values of the win parameter as well. 6. Experimental evaluation We have conducted a set of experiments to evaluate the algorithm we proposed to reduce the number of tuples searched when looking for the best k matches to a fuzzy query. The primary goal of the experiments was to see whether (a) the proposed algorithm returns the expected percentage of the results by (b) exploring only a small fraction of the entire search space. These experiments were also aimed at investigating whether (c) the proposed, approximation of predicates through their statistical properties, approach aects the solutions obtained by the algorithm negatively or not. In this section, we report on the observations we obtained through simulations.
288
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
6.1. Experiment setup In our experiments, we have varied (1) the number of top results required, k, (2) the database size (or the search space size), and (3) the error threshold. The query that we used for the simulations contains two ordered predicates, P
X and Q
Y , and one unordered predicate, R
X ; Y . We have ranged the sizes of predicates P and Q between 100 and 1000. Note that this means that there are up to 1,000,000 tuples in the database X ; Y . This is similar to the case when there are 1,000,000 images in the database, and we are using features with 1000 dierent values for retrieval. We ran the experiments on a Linux platform with a 300 MHz Pentium machine with 128 MB main memory. Each experiment was run 20 times and the results were averaged. We have generated the fuzzy values for the predicate scores according to Zipf's distribution [33,34]. This distribution guarantees that the number of high scores returned by the algorithm is less than the number of low scores, ®tting into the pro®le of many multimedia predicates. More speci®cally, we set the probability that for a given data element, a fuzzy predicate will return a score within the range
a a1 ; ; 10 10
where 0 6 a < 10, to 1=
a 1 log
1:7810. We used product semantics as the fuzzy semantics for retrieval. We have approximated lQ;o and lQ;u with functions fao ;bo ;/o
x and fau ;bu ;/u
x. For instance, for the case when the size of the database is 1,000,000, the corresponding approximation parameters are chosen as hao 7913:20; bo 8000:39; /o 0:0i (for P
X ^ Q
Y as shown in Fig. 13(e)) and hau 197315:00; bu 207807:59; /u 0:0i (for R
X ; Y ). Note that the parameters are chosen such that the approximation for R
X ; Y overestimates the scores (Fig. 15). To account for this overestimation, we varied win between 1 and 5.
Fig. 15. The distribution of R
X ; Y and its overestimating approximation.
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
289
For comparison purposes we have also implemented a brute force algorithm which explores all the search space and returns the actual rank of each element in the database with respect to a given query. In Section 6.2, we describe our observations. 6.2. Experiment results The initial set of experiments showed that (as-expected) due to the overestimation of the lQ;u (Fig. 15) the algorithm found all the data elements in top-k, it performed signi®cantly more comparisons than required, and in the worst case, it degenerated to checking all tuples. This condition, however, gave us a reasonable framework in which we can study the eect of correcting imperfect approximations through the use of the win parameter which limits the look-ahead. Fig. 16(a) and (b) shows the percentage of real top-k results among the tuples returned by the algorithm for two dierent expected percentages: (a) %20 and (b) %80 (or 0.8 and 0.2 error thresholds, respectively). For the ®rst case, we used win 2 and for the second case, we used win 5. The reason why, in this particular ®gure, we are using a larger win value for the smaller error threshold is that, when we want smaller errors, the algorithm must make its estimations using more information. Results for the other parameter assignments are provided in Appendix A. Fig. 16 shows that, as the value of k increases, the algorithm performs better; in the sense that it returns closer to the expected ratio of real top-k results. Note, on the other hand, that for the %20
Fig. 16. (a) and (c) Percentage of real top-k results among the tuples returned by the algorithm and (b) and (d) the number of tuples p enumerated (®gures are plotted to the scale for database size).
290
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
case, when the database is small, the eect is higher number of top-k results (which means that the algorithm visits more tuples than it should for the percentage provided by the user). Note that in both cases, when k and database size are sufficiently large (50 and 5000, respectively), the algorithm provides at least the expected ratio of top-k results. Fig. 16(c) and (d) shows the number of tuples visited by the algorithm. According to these figures, the proposed algorithm works very efficiently. For instance, when the database size is 1,000,000, k is 100, and expected ratio of top-k matches is %80, the algorithm visits around 2000 tuples, i.e., %0.2 of the database. Note that Theorem 4.3 implies that the number of tuples visited should be
mÿ1=m N k ; O F where N is the database size, m the number of ordered predicates (2 in our case), k the number of results, and F prob
F
i; H true j 1 6 i 6 N . According to this, the number of tuples visited must increase linearly with k. This expectation is confirmed by both figures. Again according to the theorem, sincepm 2, for a given k and assuming that F is a constant, the number of visited tuples must be O
N . In the simulation, we have actually seen an even smaller increase in the p number of tuples by increase in the database size (note that the ®gures are plotted to the scale for database size to provide a better visualization of the results). This is because the theorem gives an upper bound on the number of tuples visited: the proof assumes that n-tuples satisfying the condition F
i; H are geometrically distributed with parameter F. In the simulations, however, since we used Zipf's distribution of the predicates, the actual distribution of F is not geometrical. Consequently, the algorithm finds the matches much earlier than the upper-bound suggests. A complete set of experiment results with various parameter settings are given in Appendix A. The results presented in Appendix A mimic the sample of results presented in this section. 7. Comparison with the related work In addition to various related work we mention along with the presentation of our approach, a notable recent work by Donjerkovic and Ramakrishnan aims at optimizing a ``top-k results'' type of a query probabilistically by viewing the selectivity estimates as probability distributions. However, unlike our approach, it does not address fuzzy queries, but exact match queries on data that have an inherent order (such as salaries). Also, instead of allowing a limited error on the results themselves, it uses the selectivity estimates in optimizing the cost of exact retrieval of ®rst k results by providing probabilistic guarantees for the optimization cost. Also recently, Chauduri and Gravano [15] considered the problem of top-k selection queries eectively, their work has a dierent focus from ours. It focuses on providing a way to convert top-k queries into regular database queries. A more relevant work is by Acharya et al., which proposes algorithms aimed at providing approximate answers to warehouse queries using statistics about database content. Unlike our approach, it does not provide mechanisms to deal with similarity-based query processing and they do not address the problem of ®nding top-k results incrementally.
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
291
Fig. 17. Number of tuples enumerated by the algorithm.
8. Conclusion In this paper, we have ®rst presented the dierence between the general fuzzy query and multimedia query evaluation problems. More speci®cally, we have pointed to the multimedia
292
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
Fig. 18. Number of tuples enumerated by the algorithm.
precision/recall semantics, partial match requirement, and unavoidable necessity of fuzzy, but non-progressive predicates. Next, we have presented an approximate query evaluation algorithm that builds on [10,11] to address the existence of non-progressive fuzzy predicates. The proposed algorithm returns a set R of k results, such that each result r 2 R is most probably in the set, Rk , of
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
293
Fig. 19. Number of tuples enumerated by the algorithm.
top-k results of the query. This algorithm uses the statistical properties of the fuzzy predicates as well as the merge functions used to combine the fuzzy values returned by individual predicates. It minimizes the unnecessary accesses to non-progressive predicates, while providing error-bounds on the top-k retrieval results. Since the performance of the algorithm depends on the accuracy of the database statistics, we have discussed techniques to generate and maintain relevant statistics. Finally, we presented simulation results for evaluating the proposed algorithm in terms of quality of results and search space reduction. 9. For further reading [7,8]. Acknowledgements We thank Dr. Golshani and Y.-C. Park for providing us with data distributions, which we used for evaluating the score approximations, from their image database, ImageRoadMap.
294
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
Fig. 20. Percentage of real top-k results among the tuples returned by the algorithm.
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
Fig. 21. Percentage of real top-k results among the tuples returned by the algorithm.
295
296
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
Fig. 22. Percentage of real top-k results among the tuples returned by the algorithm.
Appendix A See Figs. 17±22.
References [1] S.Y. Lee, M.K. Shan, W.P. Yang, Similarity retrieval of ICONIC image databases systems, Pattern Recognition 22 (6) (1989) 675± 682. [2] A. Prasad Sistla, Clement Yu, Chengwen Liu, King Liu, Similarity-based retrieval of pictures using indices on spatial relationships, in: Proceedings of the 1995 VLDB Conference, Zurich, Switzerland, 23±25 September 1995. [3] Wen-Syan Li, K. Selcßuk Candan, SEMCOG: a hybrid object-based image database system and its modeling, language, and query processing, in: Proceedings of the 14th International Conference on Data Engineering, Orlando, FL, USA, February 1998. [4] Wen-Syan Li, K. Selcßuk Candan, Kyoji Hirata, Yoshinori Hara, Facilitating multimedia database exploration through visual interfaces and perpetual query reformulations, in: Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB), Athens, Greece, August 1997, pp. 538±547.
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
297
[5] Wen-Syan Li, K. Selcßuk Candan, K. Hirata, Y. Hara, Hierarchical image modeling for object-based media retrieval, Data and Knowledge Engineering 27 (2) (1998) 139±176. [6] R. Fagin, E.L. Wimmers, Incorporating user preferences in multimedia queries, in: F. Afrati, P. Koliatis (Eds.), Database Theory ± ICDT '97, LNCS, vol. 1186, Springer, Berlin, Germany, 1997, pp. 247±261. [7] R. Fagin, Y.S. Maarek, Allowing users to weight search terms, Technical Report RJ10108, IBM Almaden Research Center, San Jose, CA, USA, 1998. [8] S.Y. Sung, A linear transform scheme for combining weights into scores, Technical Report TR98-327, Rice University, Houston, TX, USA, 1998. [9] S. Adali, P.A. Bonatti, M.L. Sapino, V.S. Subrahmanian, A Multi-similarity algebra, in: Proceedings of the 1998 ACM SIGMOD Conference, Seattle, WA, USA, June 1998, pp. 402±413. [10] R. Fagin, Fuzzy queries in multimedia database systems, in: The 17th ACM Symposium on Principles of Database Systems, June 1998, pp. 1±10. [11] R. Fagin, Combining fuzzy information from multiple systems, in: The 15th ACM Symposium on Principles of Database Systems, 1996, pp. 216±226. [12] C. Faloutsos, Searching Multimedia Databases by Content, Kluwer Academic Publishers, Boston, 1996. [13] R. Richardson, A. Smeaton, J. Murphy, Using Wordnet as a knowledge base for measuring conceptual similarity between words, in: Proceedings of Arti®cial Intelligence and Cognitive Science Conference, Trinity College, Dublin, 1994. [14] Weining Zhang, Clement Yu, Bryan Reagan, Hiroshi Nakajima, Context-dependent interpretations of linguistic terms in fuzzy relational databases, in: Proceedings of the 11th International Conference on Data Engineering, Taipei, Taiwan, March 1995 [IEEE]. [15] S. Chaudhuri, L. Gravano, Optimizing queries over multimedia repositories, in: Proceedings of the 1996 ACM SIGMOD Conference, Montreal, Canada, June 1996, pp. 91±102. [16] L. Zadeh, Fuzzy sets, in: Information and Control, 1965, pp. 338±353. [17] H. Nakajima, Development of ecient fuzzy SQL for large scale fuzzy relational database, in: Proceedings of the Fifth International Fuzzy Systems Association World Conference, 1993, pp. 517±520. [18] V.S. Lakshmanan, N. Leone, R. Ross, V.S. Subrahmanian, ProbView: a ¯exible probabilistic database system, ACM Transactions on Database Systems 22 (3) (1997) 419±469. [19] H. Bandermer, S. Gottwald, Fuzzy Sets, Fuzzy Logic, Fuzzy Methods with Applications, Wiley, Chichester, UK, 1995. [20] U. Thole, H.-J. Zimmerman, P. Zysno, On the suitability of minimum and product operators for the intersection of fuzzy sets, Fuzzy Sets and Systems (1979) 167±180. [21] J. Yen, Fuzzy logic ± a modern perspective, IEEE Transactions on Knowledge and Data Engineering 11 (1) (1999) 153±165. [22] D. Dubois, H. Prade, Criteria aggregation and ranking of alternatives in the framework of fuzzy set theory, Fuzzy Sets and Decision Analysis, TIMS Studies in Management Sciences 20 (1984) 209±240. [23] R.R. Yager, Some procedures for selecting fuzzy set ± theoretic operations, International Jounral General Systems (1965) 115±124. [24] S. Adali, K.S. Candan, Y. Papakonstantinou, V.S. Subrahmanian, Query caching and optimization in distributed mediator systems, in: Proceedings of the 1996 ACM SIGMOD Conference, Montreal, Canada, June 1996, pp. 137±147. [25] Y. Alp Aslandogan, Chuck Thier, Clement Yu, Chengwen Liu, Krishnakumar R. Nair, Design, implementation and evaluation of SCORE, in: Proceedings of the 11th International Conference on Data Engineering, Taipei, Taiwan, March 1995 [IEEE]. [26] K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft, When is nearest neighbor meaningful? in: Database Theory ± ICDT '99, Springer, Berlin, Germany, 1999 [to appear]. [27] C.T. Yu, W. Meng, Principles of Database Query Processing for Advanced Applications, Morgan Kauman, Los Altos, CA, 1998. [28] A.Yoshitaka, T. Ichikawa, A survey on content-based retrieval for multimedia databases, IEEE Transactions on Knowledge and Data Engineering 11 (1) (1999) 81±93. [29] F. Idris, S. Panchanathan, Review of image and video indexing techniques (special issue on Indexing Storage and Retrieval of Images and Video ± Part II), Journal of Visual Communication and Image Representation 8 (2) (1997) 146±166. [30] Y.A. Aslandogan, C.T. Yu, Techniques and systems for image and video retrieval, IEEE Transactions on Knowledge and Data Engineering 11 (1) (1999) 56±63. [31] O. Etzioni, K. Golden, D. Weld, Sound and ecient closed-world reasoning for planning, Arti®cial Intelligence 89 (1±2) (1997) 113±148. [32] S. Chaudhuri, K. Shim, Optimization of queries with user-de®ned predicates, in: VLDB'96, 1996, pp. 87±98. [33] G.K. Zipf, Relative frequency as a determinant of phonetic change, Harvard Studies in Classical Philiology, 1929. [34] Lee Breslau, Pei Cao, Li Fan, Graham Phillips, Scott Shenker, On the implications of Zipf's law for web caching, in: INFOCOM '99, New York, USA, March 1999.
298
K. Selcßuk Candan et al. / Data & Knowledge Engineering 35 (2000) 259±298
Kasõm Selcßuk Candan is a tenure track assistant professor at the Department of Computer Science and Engineering at the Arizona State University. He joined the department in August 1997, after receiving his Ph.D. from the Computer Science Department at the University of Maryland at College Park. His dissertation research concentrated on multimedia document authoring, presentation, and retrieval in distributed collaborative environments. He received the 1997 ACM DC Chapter award of Samuel N. Alexander Fellowship for his Ph.D. work. His reserach interests include development of formal models, indexing schemes, and retrieval algorithms for multimedia and Web information and development of novel query optimization and processing algorithms. He has published various articles in respected journals and conferences in related areas. He received his B.S. degree, ®rst ranked in the department, in computer science from Bilkent University in Turkey in 1993. Wen-Syan Li is currently a research sta member at Computers & Communications Research Laboratories (CCRL), NEC USA Inc. He received his Ph.D. in Computer Science from Northwestern University in December 1995. He also holds an MBA degree. His main research interests include content delivery network, multimedia/ hypermedia/document databases, WWW, and Internet/intranet search engines.
Lakshmi Priya Mahalingam completed her Masters in Computer Science (Databases) from Arizona State University in 1999. Her masters thesis topic is in ``Query Optimization in Multimedia Databases'' under Dr. Kasim Selcßuk Candan. She is currently working for Thomson Research, Boston, developing database applications. Her research interests include query optimization, databases, algorithms. She received her B.S. in India in 1998. She was ranked ®rst in the university and received a Gold Medal for the same.