Engineering efficient metric indexes

Engineering efficient metric indexes

Pattern Recognition Letters 28 (2007) 75–84 www.elsevier.com/locate/patrec Engineering efficient metric indexes Kimmo Fredriksson * Department of Com...

246KB Sizes 2 Downloads 57 Views

Pattern Recognition Letters 28 (2007) 75–84 www.elsevier.com/locate/patrec

Engineering efficient metric indexes Kimmo Fredriksson

*

Department of Computer Science, University of Joensuu, P.O. Box 111, 80101 Joensuu, Finland Received 14 December 2004; received in revised form 23 February 2006 Available online 8 August 2006 Communicated by T.K. Ho

Abstract We give efficient algorithms and index data structures for range search in general metric spaces. We give a simple methods to make almost any existing algorithm memory adaptive, improving the search the more memory is available. For vector spaces and metric space of strings we show how several distances can be computed in bit-parallelly, in sequential computer, and use the result to improve the search performance. This method works especially well with approximate range queries. The experimental results show that we can obtain order of magnitude improvements over the existing methods.  2006 Elsevier B.V. All rights reserved. Keywords: Algorithms; Data structures; Information retrieval; Metric space indexing; Proximity searching; Bit-parallel distance evaluations; Memory adaptiveness

1. Introduction Similarity searching has a vast number of applications in numerous fields, such as audio and image databases, image quantization and compression, text or document databases, computational biology, and data mining, to name a few. A common aspect of all the applications is that we have a universe U of objects, and a non-negative distance function d : U  U ! Rþ . The distance function is metric, if it satisfies for all x; y; z 2 U dðx; yÞ ¼ 0 () x ¼ y dðx; yÞ ¼ dðy; xÞ dðx; yÞ 6 dðx; zÞ þ dðz; yÞ: The last item is called the ‘‘triangle inequality’’, and is the most important property in our case. Smaller the distance, the more similar the two objects are. Moreover, the distance function is usually (very) expensive to compute. *

Fax: +358 13 251 7955. E-mail address: [email protected].fi

0167-8655/$ - see front matter  2006 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2006.06.012

Our database S is a finite subset of that universe, i.e. S  U. The size of S is jSj = n. S is preprocessed in order to efficiently answer range queries. Given a query object q, we retrieve all objects in S that are close enough to q, i.e. we retrieve the set {u 2 Sjd(q, u) 6 r} for some user supplied r. Metric space indexing is a general technique to process range queries efficiently, and a large number of different data structures and query algorithms have been proposed. See the recent excellent survey articles Cha´vez et al. (2001) and Hjaltason and Samet (2003) for references. Usually when comparing different metric indexing algorithms one counts just the number of distance evaluations needed for the range query. In addition, there is often some associated ‘‘extra CPU time’’ that depends on the particularities of the index, but this is usually considered to be negligible, since the distance computations are usually considered to dominate the total time. However, this is not always so. In this paper we propose a simple way to make most of the existing indexes memory adaptive. This is achieved by plugging AESA (Vidal, 1986) into algorithms that use hierarchical partitioning of the space. We also show how

76

K. Fredriksson / Pattern Recognition Letters 28 (2007) 75–84

backtracking in those hierarchies can be reduced by making the partitioning trees superlinear in size. Other more specific methods are also developed. We also consider evaluating several distances in parallel, in sequential computer using bit-parallelism or SIMD extensions. This gives huge speed-ups for the range queries in vector spaces and metric space of strings. In addition, this method works extremely well when applied with approximate search strategy (Cha´vez and Navarro, 2003). We conclude with experimental results that show that over an order of magnitude speed-ups can be obtained. Some of the ideas were sketched in (Fredriksson, 2004). 2. Previous work We now briefly describe the indexes that are relevant to this paper. Many other indexing schemes could be used as well, and many of the improvements we propose could be applied to most of those other structures equally well. 2.1. Burkhard–Keller tree (BKT) (Burkhard and Keller, 1973) The hierarchy is built as follows. Some arbitrary object p 2 S is chosen for the root of the tree. The object p is called a pivot. The distances between p and the objects in Sn{p} are then evaluated. For each distinct (assuming discrete distances) distance e the root node will have a child that contains the objects that are at a distance e from the object in the root node, i.e. the child number e is recursively built using the set Se = {u 2 Sjd(p, u) = e}. This is repeated until there are only one, or in general b (for a bucket), objects left, which are stored into the leaves of the tree. The tree has O(n/b) nodes, where n = jSj, and the construction requires O(nk) distance computations, where k is the height of the tree. The search with the query object q and range r first evaluates the distance d(q, p), where p is the object in the root of the tree. If d(q, p) 6 r, then p is put into the output list. The search then recursively enters into each child e such that d(q, p)  r 6 e 6 d(q, p) + r. Whenever the search reaches a leaf, objects stored in the bucket are directly compared with q. The triangle inequality ensures that no relevant object is missed. The search requires O(na) distance computations on average, where 0 < a < 1. 2.2. List of Clusters (LC) (Cha´vez and Navarro, 2005) LC selects a random pivot c, called a center, and a covering radius cr(c) for the center. The center and its covering radius define a zone, and cr(c) is the maximum distance from c to any other object in the zone. A parameter h defines the number of objects in each zone. The list is built as follows. The first center is chosen in random. Then its h  1 nearest neighbors are selected, and cr(c) is the distance to the (h  1)th neighbor. The zone is then c and its h  1 nearest neighbors. The set of these objects is called

I, and the rest of the list is recursively built for E = SnI. The next center selected is the one that maximizes the sum of distances to all previous centers. The search evaluates the distance e = d(q, c), and if e 6 r the center c is reported. If e 6 cr(c) + r, i.e. the query intersects the zone, the bucket of h  1 objects in I is searched exhaustively. If e > cr(c)  r, i.e. the query is not fully contained in I, then E (the rest of the list) is searched recursively. The performance of LC heavily depends on the parameter h. As shown in (Cha´vez and Navarro, 2005) this should be the smaller the larger the dimension or search ´ radius isp(e.g. ffiffiffi in Chavez and Navarro, 2000 the authors use h ¼ n). See the more detailed discussion in Section 3.1. 2.3. Vantage-Point Tree (VPT) (Yianilos, 1993) VPT (also known as ‘‘Metric Tree’’ Uhlmann, 1991) is basically a balanced binary tree version of LC, up to pivot selection techniques (although historically VPT appeared before LC). That is, the I and E contain half of the objects each, and both are built recursively. See Section 3.1 for more details. There are many variants of VPT, see Cha´vez et al. (2001) for references. 2.4. Approximating Eliminating Search Algorithm (AESA) (Vidal, 1986) AESA is an extreme case of pivot based algorithms. The data structure is simply a precomputed matrix of all the n(n  1)/2 distances between the n objects in S. The space complexity is therefore O(n2) and the matrix is computed with O(n2) distance computations. This makes the structure highly impractical for large n. The range search is equally simple. First a random object (pivot) p is selected, and the distance e = d(q, p) is evaluated. Then each element u 2 S that does not satisfy e  r 6 d(u, p) 6 e + r is eliminated, i.e. we compute a new set S 0 = {u 2 Sn{p}je  r 6 d(u, p) 6 e + r} The distances d(u, p) are retrieved from the precomputed matrix. However, the elimination process has to make a linear scan over the set S, so the cost is the time for one distance computation plus O(n). This process is repeated with a new pivot p taken from the qualifying set S 0 , until only b objects are left, which are directly compared. By experimental results (Vidal, 1986) the search algorithm makes only a constant number of distance comparisons on an average which means O(n) extra CPU time on an average. 3. Our algorithms 3.1. Hierarchy of Clusters We convert the List of Clusters into Hierarchy of Clusters (HC). That is, we build a (optionally balanced) binary tree instead of a list. Only two modifications are needed.

K. Fredriksson / Pattern Recognition Letters 28 (2007) 75–84

First, we recursively build the ‘‘list’’ for the set I as well, and not just for E. This turns the list into a binary tree. Second, we use function h(n) to control the balance of the tree. For each node, with n objects, we put h(n) objects into the I branch, and n  h(n) objects into E branch. Balanced VPT type tree is obtained with h(n) = n/2. The search algorithm is the same as for LC, except that if we must search I, we do not do it exhaustively, but recurse into the I subtree as in VPT. As in the case of BKT, we may cut the tree construction if there are at most b objects left, and store these as a bucket. In both BKT and HC using b > 1 only saves (a small amount of) space, the exhaustive search is always slower compared to fully evaluating the tree (b = 1). The construction takes O(n log n) distance evaluations, if the tree is balanced, but this moves towards O(n2) as the degree of unbalance increases. We call this structure a Hierarchy of Clusters, HC for short, or UHC when we want to make the unbalance explicit. As experimentally shown in Section 5, the optimal tree is certainly not balanced. The search algorithm always enters at least one of the two subtrees (and sometimes both). We therefore define an additional radius cr 0 (c) for each center c, which is the largest distance to the objects that are descendants of c (i.e. the objects in I and E branches). If at query time d(c, q) > cr 0 (c) + r, then we can stop the recursion. as all the objects are outside the query ball. This means that we do not have to enter either of the branches. Note that the opposite may also be true, i.e. the query ball may totally cover the I zone (i.e. d(c, q) + cr(c) 6 r) or both I and E zones. This means that all the descendants are in the range, and may be reported without any additional distance evaluations. Otherwise we use cr(c) as before. The same idea can be used with BKT as well. In (Cha´vez and Navarro, 2005) the authors analytically show that the optimal performance for HC type hierarchy (the data structure was not made explicit, however) is achieved for the minimum b (0 < b < 1) that satisfies b

b

F ðs þ rÞF ðsÞ þ ð1  F ðs  rÞÞð1  F ðsÞÞ ; where F(x) = Prob(d(a, b) 6 x), r is the search radius and s is the cutting point of distances that go to the left subtree. The search then takes O(nb) time. Note that F(x) can be seen as the cumulative histogram of distances, and this heavily depends on the dimension of the space, becoming more concentrated around the mean as the dimension grows. The only free parameter in the above equation is s, suggesting that for high dimensional space and large r one should use small s, i.e. small covering radius in HC. 3.2. Multiple hierarchies As detailed above the optimal unbalance depends on the search range. However, this is usually not known at the time the index is constructed. A way to trade space for time is then to build several instances of HC, each optimized for a specific query range. At search time a more unbalanced

77

structure is chosen for larger search ranges. We show experimentally in Section 5 that only a few hierarchies suffice in practice. 3.3. Combination of AESA and BKT/HC The problem with AESA is its high preprocessing and space complexities. For small dictionaries this is not a problem, so we propose using AESA to implement additional search algorithm for the buckets of b objects stored into the leaves of the tree based indexes BKT and HC. This means that the space complexity becomes O(b2 · n/ b) = O(nb), and the construction takes O(b2n/b + n log(n/ b)) = O(n(b + log(n/b))) distance evaluations. These are very reasonable for many applications. Our experimental results show that this simple idea improves BKT and HC considerably. This idea is of course very general, and can be applied to many other indexing algorithms, any distance functions or objects just as well. The net effect is that the algorithms become memory adaptive; the more memory we can give to them, the fewer the number of distance computations are needed at search time. We call the resulting algorithms ABKT and AHC. 3.4. Extended BKT BKT has for each node a child for each distinct distance between the corresponding pivot p and the rest of the objects. The search then enters into each child whose corresponding distance e satisfies d(q, p)  r 6 e 6 d(q, p) + r. A way to trade space for time in this case is to build the children for ranges of distances, and not just for single distances. That is, we label the children with the pairs (e, r) instead of just e. The child (e, r) of the new tree then contains the children e  r . . . e + r of BKT, collapsed into a single child. The benefit comes at the search time, we enter only into one child (d(q, p), r). We call this algorithm EBKT. The problem with this approach is its superlinear space complexity, as same object is entered in several subtrees. Hence this method is feasible for small databases only, and it requires that the (maximum) query range is known in advance. This rules out many applications, but e.g. online approximate word level string matching is well suited for this. The query range is known before the index is built, and the number of patterns is usually small or moderate. Also, r is usually a (relatively small) constant in this case, so the children can be constructed for only the one required range. For large dictionaries there are two options. It can be used to replace AESA to implement the buckets for the leaves of BKT or HC. In fact, EBKT can be seen as an efficient implementation of AESA like algorithm, which reduces the extra CPU time to O(1) per distance evaluation. Other possibility would be to use EBKT in the top level, but build only a few levels of the tree, and then switch back to BKT.

78

K. Fredriksson / Pattern Recognition Letters 28 (2007) 75–84

Finally, note that the structure can be built for some (single) range r 0 , which does not have to be the query range. If at the search time the query radius r is less than r 0 , we can still enter only one child (the one labeled with d(q, p)). This branch is larger than necessary, as we know that some objects in that branch must be too far away, but it does not affect the correctness of the search algorithm. Likewise, in principle it is possible to use the structure in the case where r > r 0 , by entering the children d(q, p)  (r  r 0 ) . . . d(q, p) + (r  r 0 ). However, this makes the search only slower in practice as compared to plain BKT. 3.5. Extended HC Similar methods as in EBKT can be used for other structures as well. We briefly outline the method for HC here. Recall the branching conditions of the search algorithm: if d(q, c) 6 cr(c) + r, the query intersects the I zone, and hence I branch is entered; if d(q, c) > cr(c)  r, the query is not fully contained in I, and E branch is entered. Assume now that we know the query range (or its upper bound) at the index construction time. We use two covering radii for center, crL ðcÞ (lower bound) and crU ðcÞ (upper bound). These can be selected in any way preferred, but we require that jcrL ðcÞ  crU ðcÞj P 2r, where r is the query range. The new hierarchy has four branches for each node, I L , I U , EL and EU corresponding to the two covering radii, i.e. I L and EL are built using crL ðcÞ and similarly I U and EU using crU ðcÞ. In other words, each object is stored in two subtrees, making the tree superlinear in size. The benefit comes at the query time. Since each object is stored either in I L [ EL or in I U [ EU we can use either crL ðcÞ or crU ðcÞ to prune the search. Because of the condition jcrL ðcÞ  crU ðcÞj P 2r the query range is totally contained in one of the four possible branches. We enter only this branch. We call this algorithm EHC. 4. Approximate searching and bit-parallel distance evaluation In (Cha´vez and Navarro, 2003) a general probabilistic method was given to ‘‘stretch the triangle inequality’’. The normal triangle inequality allows us to discard any element u in the database that satisfies jd(u, p)  d(q, p)j > r, where q is the query object, and p is some pivot object. ‘‘Stretching’’ means that we now discard u if jd(u, p)  d(q, p)j > r/c holds, for some c P 1. They show that this method can dramatically reduce the number of distance computations, while keeping the error probability low, i.e. only a small fraction of relevant results is missed, see Cha´vez and Navarro (2003) for details. We omit the details here for lack of space, but the fundamental consequence for our algorithms, for example for HC, is that we can prune the I branch if d(q, c) > cr(c) + r/c, and the E branch if d(q, c) 6 cr(c)  r/c. The condition that we report the center c if d(q, c) 6 r is unaffected. The same idea can be applied to the other algorithms as well. Note that the idea

is better suited for continuous than discrete distance functions with small range. We now present a technique that evaluates several distances in parallel (in a sequential computer). While speeding-up the search in general, the additional benefit is that the method works especially well with approximate searching. The inherent parallelism of the standard arithmetic and bit-wise operations of modern computers can be used to parallelize many simple computations. The basic idea is to pack several data objects into a single computer word so as to update all of them in parallel. In (Hyyro¨ et al., 2004) a new on-line multiple approximate string matching algorithm was proposed. The algorithm is based on the bit-parallel algorithm of Myers (1999), and can search R patterns, each of length m from a text of length n in O(ndR/bw/mce) time, where w is the number of bits in a machine word. This algorithm is trivial to modify to compute the edit distance of a query string q against R other strings in parallel. Edit distance ed(a, b) computes the number of insertions, deletions and substitutions that are needed to convert string a into string b. For strings of length m, we can compute R = bw/mc edit distances in time O(jqj), i.e. about 4 · 10 edit distances in parallel, assuming w = 32 or 64, and that average string length is about 6 · 8. This raises possibilities for parallel (edit) distance evaluations for metric indexing algorithms. Many recent processors support SIMD (Single Instruction-Multiple Data, or ‘‘multimedia’’) instructions that can be used to evaluate several distances in parallel. For example, Intel Pentium3’s and AMD Athlon’s SSE instructions1 can be easily used to evaluate 4 L1 (Manhattan), L2 (Euclidean) or L1 (max) distances in parallel.2 The necessary instructions are supported by many recent C/C++ compilers, and e.g. GNU gcc and Intel icc allow the use of standard arithmetic operations with the SIMD data types. If the range of the vector values is small, and do not require great accuracy, then it is very easy to use the SSE2 extensions and fixed point arithmetic to evaluate e.g. eight distances in parallel using 16 bit fixed point numbers. 4.1. Parallel BKT Simplest application of parallel edit distance computation to BKT is to store a bucket of R strings into each node, instead of only one (the pivot), and use one of them as the pivot string for building the hierarchy and guiding the search. In the preprocessing phase the effect is that the tree has only O(n/R) nodes (assuming b = 1). At the search time, we evaluate the distance between the query string and the pivot as before, but at the same time, with1

Other vector extensions are e.g. AMD’s 3DNow!, DEC’s MVI, Sun’s VIS, Motorola’s AltiVec, and MIPS’s MDMX. 2 Note that speeding up single distance evaluations using the same instructions is also possible, but much less efficient, because the instruction set lacks support for ‘‘horizontal instructions’’.

K. Fredriksson / Pattern Recognition Letters 28 (2007) 75–84

out any additional cost, we evaluate R  1 other distances. For these R  1 other distances we just check if they are close enough to the query (this can be done in parallel in O(1) time), but do not use them for any other purpose, only one of the strings is used as a pivot and to guide the search. We call this algorithm PBKT. One can still use AESA to implement the leaves. 4.2. Parallel HC HC can be used for vector spaces as well, since the distance function is not assumed to be discrete. The above simple method would work with HC as well, but in this case we will use the ‘‘extra centers’’ to increase the arity of the tree to R + 1. For each node in the hierarchy, we choose R centers, and build correspondingly R ‘‘I zones’’. The covering radii are chosen so that each zone includes h(n)/(R + 1) objects. The final n  h(n)R/(R + 1) objects go into the E branch. The zones are built sequentially, so that objects that go into the Ii branch are not considered when building Ii+1. All the I zones and the one E zone are built recursively, as in binary HC. To produce balanced tree, set h(n) = n. The search algorithm evaluates the R distances ei = d(q, ci) in parallel. Again, if ei 6 r, the center ci is reported. If ei 6 cr(ci) + r, i.e. the query intersects the zone, the zone Ii is searched recursively. If ei 6 cr(ci)  r, i.e. the query is fully contained in the zone Ii, we prune the zones Ii+1 . . . IR and E. We call this algorithm PHC. As with HC, heavily unbalanced tree (UPHC) performs best in practice.

4.3. Rationale Approximate searching works extremely well with parallel distance computations, since it affects only the branching of the search algorithm, and the unchanged condition d(q, ci) 6 r to detect the matching objects applies to all the R centers/pivots. Hence there are much less possibilities to miss a relevant object. This is verified with the experimental results. 5. Experimental results We have implemented all the algorithms in C/C++, and compiled with icc 7.1. We ran the experiments in 2 GHz Pentium4 with 512 MB RAM, running GNU/Linux 2.4.18. Implementation. For the bit-parallel edit-distance computation we used fixed R = 8 and implemented the parallel edit distance computations using the 128 bit SSE2 extensions available in Intel Pentium4 and AMD64 architectures. The compiler supports standard arithmetic operations to use the extensions. All the strings in the data structures are preprocessed for bit-parallel searching at the construction time. For the parallel computation of L2 metric for vector spaces we used R = 4, implemented with SSE instructions. We also implemented R = 8 with fixed-point arithmetic, using SSE2 extensions. Data sets. For the experiments on metric space of strings with edit distance, we used a dictionary of 98,580 English words, and the query words were randomly selected from the dictionary. We also run some experiments with 100000 distance evaluations

100

time (s)

10 HC UHC BKT PHC UPHC PBKT

1

0.1

10000 HC UHC BKT PHC UPHC PBKT

1000

100 1

2

3 r

4

5

1

60

2

3 r

4

5

3 r

4

5

90000 1.0 2.0 3.0 4.0 5.0 6.0

40

1.0 2.0 3.0 4.0 5.0 6.0

80000 distance evaluations

50 time (s)

79

30 20 10

70000 60000 50000 40000 30000 20000 10000

0

0 1

2

3 r

4

5

1

2

Fig. 1. Results for 1000 random queries, and full dictionary. Left: total query times. Right: average number of distance evaluations/query. Top row: comparison between different algorithms. Bottom row: UHC plots h(n) = n1/a/2.

80

K. Fredriksson / Pattern Recognition Letters 28 (2007) 75–84

100,000 strings of length 8 extracted from the Escherichia coli DNA sequence. For vector spaces with L2 distance, we used randomly generated data in k dimensional unitary hyper cube. This allows easy control of the dimension, while in real world data the representational dimension might be much higher than the real, intrinsic, dimensionality. The size of the data set was 100,000 vectors. Finally, we experimented with real document data. The data set was 15,973 Usenet News articles, 1000 articles were used for queries, and the rest were indexed. We used cosine distance. 5.1. Pivot selection It is well known that the pivot (or center) selection can have a large impact on the query performance. See Bustos et al. (2003) for a general method for choosing good pivots, not depending on the type of the objects indexed. The sim-

plest method is to select the pivots at random. More elaborate methods may need much more preprocessing effort, but this is usually amortized by faster queries. The most immediate improvement for the presented methods would be to improve the naı¨ve random pivot selection. However, this is orthogonal to our other improvements. We used the following strategies in the experiments: BKT and (P)HC. We take the object that minimizes the mean squared distance to the rest. Since the number of objects can be very large, random sampling is used to reduce the number of distance evaluations. This is done by randomly selecting P candidates and evaluating their mean squared distance to D randomly selected objects. Each pivot is therefore selected with at most O(PD) distance computations (less near the leaves, since the number of objects become less than P or D). The values P = 100 and D = 1000 produce good results. 4096

AHC ABKT EBKT 4 EPBKT

distance evaluations

time (s)

8

2 1 0.5 0.25

512

0

10 20 30 40 50 60 70 80 90 size (MB)

4096 distance evaluations

time (s)

1024

10 20 30 40 50 60 70 80 90 size (MB)

AHC ABKT EBKT 4 EPBKT 2 1 0.5

AHC ABKT EBKT EPBKT

2048

1024

512

256

0

10 20 30 40 50 60 70 80 90 size (MB)

0

64

10 20 30 40 50 60 70 80 90 size (MB)

32768 distance evaluations

AHC ABKT 32 EPBKT time (s)

2048

256

0

8

0.25

AHC ABKT EBKT EPBKT

16 8 4 2

AHC ABKT EPBKT 16384

8192

4096 0

10 20 30 40 50 60 70 80 90

size (MB)

0

10 20 30 40 50 60 70 80 90 size (MB)

Fig. 2. The effect of giving the algorithms more memory. Top: dictionary of 98,580 English words; middle and bottom: 100,000 E. coli strings of length 8. Left: total time in seconds for 1000 random queries. Right: the average number of distance evaluations per query. Top and middle rows use r = 1, the bottom row r = 2. The x-axis shows (approximately) the memory used by the algorithms.

K. Fredriksson / Pattern Recognition Letters 28 (2007) 75–84

PBKT. In the case of PBKT we have two kinds of pivots: the ‘real’ pivot, and r  1 ‘free riders’. The real pivot is selected as in BKT, so as to have good discriminating power. The r  1 free objects are taken as the r  1 objects that are farthest away from the real pivot. U(P)HC. We used random pivots, since it produced very good results. The construction is costly even with random selection. 5.2. Experiments on strings Fig. 1 shows the total CPU times, and the average number of distance computations for 1000 random queries for different algorithms. For UHC we used h(n) = n1/3/2. This produces extremely unbalanced hierarchies, but for large query radii produce much better results than balanced HC. Note that HC corresponds to basic VPT. We

also give plots showing the effect of different choices of h(n). In general, UPHC and PBKT are the winners. PBKT has the advantage that the preprocessing is very efficient, while it is very slow for UPHC, depending on the degree of unbalance. Plugging AESA into HC and BKT reduces the number of distance evaluations considerably, see Fig. 2. In general, the times reduced only for small bucket sizes, due to the higher CPU times of AESA. For UHC the benefit is negligible (results not shown). Using AESA with PHC or PBKT makes the algorithms only slower (results not shown). We also used E(P)BKT for the top 1 . . . 5 levels of the tree (and plain (P)BKT for the rest). This is clearly faster than ABKT, as the extra CPU time is negligible. However, the number of distance evaluations for EBKT is clearly larger than for ABKT, hence the method of choice depends on the complexity of the distance function. As shown, EBKT

100000 distance evaluations

100

time (s)

10 HC UHC PHC UPHC FPHC FUPHC

1

0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 r

10000

1000

HC UHC PHC UPHC FPHC FUPHC

100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r

1

100000 distance evaluations

100

time (s)

10 HC UHC PHC UPHC FPHC FUPHC

1

0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 r

1000

HC UHC PHC UPHC FPHC FUPHC

100000 HC AHC 100 AHC 1000 UHC PHC UPHC

distance evaluations

100

10000

100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r

1

1000

time (s)

81

10

1

0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 r

1

10000

1000

100

HC AHC 100 AHC 1000 UHC PHC UPHC

10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r

Fig. 3. Results for 1000 random queries. Top: dimension 10. Middle and bottom: dimension 15. Left: total query times. Right: average number of distance evaluations/query. The number after AHC denotes the bucket size.

82

K. Fredriksson / Pattern Recognition Letters 28 (2007) 75–84

is useful only for small query ranges, contrary to ABKT and AHC. 5.3. Experiments on vectors Fig. 3 shows the times and number of distance evaluations for different algorithms. BKT works only for discrete functions, so it was omitted. We also experimented with fixed point arithmetic versions of PHC, denoted by FPHC. The results are similar to the edit distance results, parallel computation makes the range queries much faster. We also compared AHC with different bucket sizes against the other algorithms. AHC becomes competitive against (U)PHC with respect to the number of distance evaluations only for quite large bucket sizes. However, large bucket sizes mean large CPU costs, which shows as the distance computations are relatively cheap. Fig. 4 shows the performance of HC for differently (un)balanced hierarchies. We show only the number of distance evaluations, the actual times were similar. We used two different balancing schemes: (i) h(n) = n1/a and (ii) h(n) = n/a. The latter was suggested in (Cha´vez and Navarro, 2005). For small query ranges unbalancing helps significantly only for dimensions larger than about 10. For large dimensions the benefits show already for quite small query ranges. The results of the method (i) converge for large query ranges, while the method (ii) is more stable especially for small ranges. Either case, the results suggest

100000 balanced 1.5 2.0

10000

distance evaluations

distance evaluations

100000

that only a few trees need to be preprocessed and stored to get good overall performance. As unbalancing does not work well for low dimensions and small query ranges, we can try to improve the search by using AHC or EHC. Fig. 5 shows the results for six dimensional vector space. For EHC we built the structure for 0 . . . 6 levels of the tree, and then switched back to plain HC. For AHC the maximum bucket size was 1000. As shown, AHC uses the memory better to reduce the number of distance evaluations, and is also faster for small bucket sizes than EHC. However, EHC eventually becomes faster when even more memory is given to the two algorithms. Again, the large CPU cost of AESA shows up. We also experimented with high dimensional spaces, but EHC gave negligible improvements, so we omit the plots. Finally, Fig. 6 shows the percentages of reduced time and distance evaluations, as well as the percentage of the relevant objects (i.e. not missed) retrieved for approximate search, using c = 1.0 . . . 3.0. PHC tolerates the largest c values, while the unbalanced algorithms are the worst. However, UPHC is still the method of choice, since it is beated in time efficiency or in the number of distance computations performed only when c comes close to 3, but in that case all the algorithms will miss 20–40% of the relevant objects. PHC retrieves 96.51% of the relevant results even for c = 2, and UPHC 97.74% for c = 1.6. Comparing UPHC with c = 1.6 against UPHC, UHC, and HC

1000

100

10 8 10 dimension

12

1000

100

14

4

6

8 10 dimension

12

14

100000

10000

balanced 2.0 4.0 6.0

100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r

distance evaluations

6

100000 distance evaluations

10000

10 4

1000

balanced 5 10 100

10000

1000

balanced 10 100 1000

100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r

a

Fig. 4. The effect of the unbalance for HC. We used h(n) = n /2 (left plots) and h(n) = n/a (right plots) to compute the subtree sizes. The curves are for different a. Top row shows the number of distance evaluations for different dimensions. In each case the query retrieved 0.01% of the database. The bottom row shows the average number of distance evaluations for 15 dimensional vectors and 1000 random queries for different query ranges.

K. Fredriksson / Pattern Recognition Letters 28 (2007) 75–84 0.7

90 distance evaluations

AHC EHC

0.6 time (s)

83

0.5 0.4 0.3

AHC EHC

80 70 60 50 40 30

0.2

20 0

20

40

60 80 100 120 140 160 size (MB)

0

20

40

60 80 100 120 140 160 size (MB)

Fig. 5. Comparing EHC and AHC in six dimensional vector space. The query retrieves 0.01% of the database. Left: the total time for 10,000 queries; right: the average number of distance evaluations. The x-axis shows the memory used.

20

40000

time (s)

15

distance evaluations

HC UHC FPHC FUPHC

10 5

30000 25000 20000 15000 10000 5000

0

0 1.5

2 γ

2.5

3

1

100

85 80

65

2 γ

2.5

3

HC UHC FPHC FUPHC

90

70

1.5

100

95

time

relevant results retrieved (%)

1

75

HC UHC FPHC FUPHC

35000

HC UHC FPHC FUPHC

60 1

1.5

2 γ

2.5

3

10

1 70

75 80 85 90 95 relevant results retrieved (%)

100

Fig. 6. The effect of approximate search for r = 0.7 and for different c values, for vector space of dimension 15.

(VPT) all with c = 1.0, gives 3.1, 7.3 and 15.1 fold speedups respectively. 5.4. Experiments on document retrieval Finally, we experimented with real document data. The data set was 15,973 Usenet News articles, 1000 articles were used for queries, and the rest were indexed. The documents are treated as vectors in space where the terms (distinct words) are used as the coordinate values. The coordinate value of the document i and term j is defined as (1 + log (fi,j))(1 + log(n/nj)), where fi,j is the number of occurrences of term j in document i, and nj is the total number of documents where the term j occurs, and n is the total number of documents. Many other weighting schemes are

possible, see e.g. Witten et al. (1999). The resulting very high dimensional space was then mapped into vector space of dimension 50, using the random projection method proposed in (Bingham and Mannila, 2001). This process can be seen as Latent Semantic Analysis (Indexing), see Bingham and Mannila (2001) and Baeza-Yates and RibeiroNeto (1999). The document vocabulary was also preprocessed with stemming and removal of stop words, but the details are out of the scope of the present paper. The distance function was the widely used cosine distance, i.e. the angle between the two document vectors. Fig. 7 shows the results for the above document collection, for various query ranges. Using approximate search (c = 2.0) with UPHC retrieves >99.6% of the relevant answers for all tested ranges. AHC performs especially well

84

K. Fredriksson / Pattern Recognition Letters 28 (2007) 75–84

10000 distance evaluations

distance evaluations

10000

1000 HC UHC PHC UPHC UPHC (apprx) 100

HC AHC 10 AHC 100 AHC 1000 UHC PHC UPHC (apprx)

1000

100 10

20

30 r

40

50

10

20

30 r

40

50

Fig. 7. Results for a collection of 14973 indexed documents and 1000 queries, showing the average number of distance evaluations/query. For UPHC (apprx) we used c = 2.0. The range (r) is given in degrees of the angle. The right plot shows the performance of AHC with different bucket sizes.

in this application, but requires very large bucket sizes. Even then, it is competitive against approximate UPHC only for small query ranges. 6. Conclusions We have considered indexing in metric spaces, and proposed several simple improvements over existing algorithms. These are experimentally shown to perform well in practice. Most of the ideas are general, and can be applied to many other indexing methods as well. We have considered only the range search, while in some applications nearest neighbor (NN) searches are more interesting. However, as shown in (Cha´vez et al., 2001), NN searches can be systematically built over range search algorithms for any of the algorithms we have considered. Acknowledgements The comments from the anonymous reviewers are highly appreciated. Supported by the Academy of Finland, grant 207022. References Baeza-Yates, R., Ribeiro-Neto, B., 1999. Modern Information Retrieval. ACM Addison–Wesley. Bingham, E., Mannila, H., 2001. Random projection in dimensionality reduction: Applications to image and text data. In: Proc. (KDD’01). ACM Press, pp. 245–250. Burkhard, W.A., Keller, R.M., 1973. Some approaches to best-match file searching. Comm. ACM 16 (4), 230–236.

Bustos, B., Navarro, G., Cha´vez, E., 2003. Pivot selection techniques for proximity searching in metric spaces. Pattern Recognition Lett. 24 (14), 2357–2366. Cha´vez, E., Navarro, G., 2000. An effective clustering algorithm to index high dimensional metric spaces. In: Proc. 7th Internat. Symposium on String Processing and Information Retrieval (SPIRE’2000). IEEE CS Press, pp. 75–86. Cha´vez, E., Navarro, G., 2003. Probabilistic proximity search: Fighting the curse of dimensionality in metric spaces. Inf. Process. Lett. 85, 39–46. Cha´vez, E., Navarro, G., 2005. A compact space decomposition for effective metric indexing. Pattern Recognition Lett. 26 (9), 1363–1376. Cha´vez, E., Navarro, G., Baeza-Yates, R., Marroquin, J.L., 2001. Searching in metric spaces. ACM Comput. Surveys 33 (3), 273–321. Fredriksson, K., 2004. Metric indexes for approximate string matching in a dictionary. In: Proc. 11th Internat. Symposium on String Processing and Information Retrieval (SPIRE’2004), LNCS, vol. 3246. SpringerVerlag, pp. 212–213. Hjaltason, Gisli R., Samet, Hanan, 2003. Index-driven similarity search in metric spaces. ACM Trans. Database Systems (TODS) 28 (December), 517–580. Hyyro¨, H., Fredriksson, K., Navarro, G., 2004. Increased bit-parallelism for approximate string matching. In: Proc. 3rd Workshop on Efficient and Experimental Algorithms (WEA’04). LNCS, vol. 3059, pp. 285– 298. Myers, G., 1999. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. Assoc. Comput. Mach. 46 (3), 395–415. Uhlmann, J., 1991. Satisfying general proximity/similarity queries with metric trees. Inf. Process. Lett., 175–179. Vidal, E., 1986. An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recognition Lett. 4, 145–157. Witten, I.H., Moffat, A., Bell, T.C., 1999. Managing Gigabytes Compressing and Indexing Documents and Images, 2nd ed. Morgan Kaufmann Publishers Inc. Yianilos, P., 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In: Proc. 4th ACM-SIAM Symposium of Discrete Algorithms (SODA’93). SIAM Press, pp. 311–321.