Journal of Algorithms 41, 417–428 (2001) doi:10.1006/jagm.2001.1193, available online at http://www.idealibrary.com on
Toward Optimal -Approximate Nearest Neighbor Algorithms Matthew Cary Department of Computer Science and Engineering, University of Washington, Seattle, Washington 98195 E-mail:
[email protected] Received May 5, 2000
We combine the recent optimal predecessor algorithm with a recent randomized stratified tree algorithm for an -approximate nearest neighbor to give an algorithm for an -approximate nearest neighbor in a fixed-dimensional space that is optimal with respect to universe size. We also give a deterministic version of the stratified tree algorithm. 2001 Elsevier Science Key Words: data structures; searching; point location; nearest neighbor problem; approximation algorithms.
1. INTRODUCTION The nearest neighbor (NN) problem on a set of n points P is to build a data structure that, when given a query point q, finds p ∈ P such that for all p ∈ P dp q ≤ dp q. In low dimensions (2 or 3), this is considered a solved problem, with techniques such as Voronoi diagrams providing practical, log-height tree structures. Finding algorithms that work in arbitrary dimension has been more elusive; algorithms that work well for the plane have running times exponential in dimension. This phenomenon is known in the folklore as the curse of dimensionality. Lower bounds have recently appeared that only somewhat support this curse [5, 6]. The gap between these lower and previous upper bounds is still exponential. The -approximate nearest neighbor (-ANN or just ANN) problem is as NN, but the returned point p need only satisfy dp q ≤ 1 + dp q for all p ∈ P. Techniques from exact nearest neighbor have produced practical algorithms for low dimensions, with search times that are exponential in the dimension, such as the balanced box decomposition of Arya et al. [3]. In the last 3 years a variety of techniques have produced algorithms for 417 0196-6774/01 $35.00 2001 Elsevier Science
All rights reserved
418
matthew cary
this problem that have search time polynomial in dimension and polylogarithmic in the number of points [8–10]. Even more recently, Amir et al. [1] adapted one-dimensional stratified tree techniques for ANN, achieving an O d + log log N + log 1/ search time, where N is the universe size. This is only a factor of O log log log N away from the optimal universerelative bounds [4]1 , however, the algorithm requires space at least 1/d . We match the search time of the lower bound here; however, our algorithm; also requires space of at least 1/d , which is more than allowed by the lower bound. In Section 3 we present a deterministic stratified tree that is a variation of Amir et al. [1]. The general method used is applied in Section 4 to the optimal predecessor algorithm [4] to produce a constant-dimension universe-relative optimal nearest neighbor algorithm. 1.1. Organization In Section 2 we will give some definitions and a brief overview of a related algorithm. In Section 3 we will give a deterministic variant of the Amir et al. [1] algorithm. In Section4 we give an improved algorithm based on the optimal predecessor algorithm.
2. DEFINITIONS AND STRATIFIED TREES This section introduces notation and definitions used throughout the paper. Several spaces are used. The d-dimensional space of real numbers is denoted d . For x y ∈ d , the distance between x and y is denoted dx y. l2 (Euclidean) distance is assumed unless otherwise specified. A d-dimensional universe of size N is the d-dimensional cross product 1 N × · · · × 1 N d
also denoted 1 Nd or Nd . Bp r denotes the ball of radius r centered at p Bp r = x dx p ≤ r.
1 The lower bound of Beame and Fich applies directly to exact nearest neighbor; a reduction by Indyk (private communication) shows that the lower bound applies to one-dimensional approximate nearest neighbor as well, with space increasing by a factor of O 1/2 and an additive factor of O 1/ to the time.
optimal -approximate nearest neighbor algorithms
419
2.1. Stratified Trees Amir et al. [1] present what is to our knowledge the first application of stratified trees (van Emde Boas trees) to multidimensional problems. Stratified trees [11] are used for sub-logarithmic search structures. The key idea is to use a tree with a large branching factor, proportional to the number of objects stored in the tree; this idea was extended for later search data structures ([2, 7]). Amir et al. demonstrate a two-part randomized algorithm that performs -approximate nearest neighbor queries in expected O d + log log N + log 1/ time. The first part is a direct application of stratified trees to multiple dimensions. The second uses the stratified tree to provide a d 2 -approximation of the distance to the nearest neighbor, which is used to efficiently reduce the problem to an -approximate point location in equal balls.
3. A TWO-DIMENSIONAL DETERMINISTIC STRATIFIED TREE Here we will present a deterministic O log log N-time algorithm for solving the -approximate nearest neighbor problem in a two-dimensional universe of size N for < 1 in time O log log N and space polynomial in P and 1/. This data structure is inspired by Amir et al. We present a twodimensional version of the tree. Using the methods of Amir et al. [1], it can be generalized to arbitrary dimension d to run in time O d + log log N. Our search works by subdividing the universe into overlapping balls whose radius depends on 1/. The intuition is that an error of r in the search result may be tolerated, if the query is at least distance r/ from any other point. This technique is described in an extended version of Amir et al. [1], although we developed it independently for Section 4. 3.1. The Algorithm A tree is constructed from the initial universe N and set P of points. Each node of the tree is described by a point set P in universe N /2 for N ≤ N. If N ≤ 1/2 , store in O 1/4 space a table of the nearest neighbor to each point in the universe. √ 2 2 2 If N > 1/ , split N / into N / squares S , each of size N × ij √ N √ . Define the center of Sij to be pij . Define N balls Bij by Bij = B N / pij . Create a hash table for pairs i j such that P ∩ Bij = . In each entry for a nonempty Bij store the √ recursive data structure for the subproblem P ∩ Bij in a universe of size N /. Also store a recursive data structure H for the nonempty Sij considered as points in a universe of side
420
matthew cary
FIG. 1. Deterministic search algorithm.
√
N /. This is analogous to the H of the algorithm of Amir et al. (Section 2), except that the universe is larger by a factor of 1/. The search for the nearest neighbor of a query point q in a universe of size N is analogous to that of a one-dimensional stratified tree and is given in Fig. 1. To prove correctness, note that there are two places where error is introduced. First, for recursing in a Bij , there may be a point outside of Bij that is closer to q than any point inside. Second, for recursing on H there may be a point closer √ to q than the point actually chosen from Shk . In either case in either case the distance from error in O N is introduced; however, √ q to this nearest neighbor is N/, because it lies outside of Bij . By the discussion in the previous section this produces a correct approximate nearest neighbor. To analyze the running time we consider how much the universe is reduced at each recursion. If the initial universeis N = 2k , after i reduci−1 i i i tions the universe has been reduced to 2k/2 −log j=0 1/2 ≤ 2k/2 −2 log . After log log N reductions the universe has been reduced to 2/2 or smaller, where the base case of the search algorithm finds the nearest neighbor in constant time. Hence the running time is O log log N. Because of the overlapping Bij ’s, each point in the original point set P may appear in O 1/2 leaves. As noted above, the storage for each leaf is O 1/4 , giving a total storage for the leaves of O P/6 . Each internal node in the tree of recursive data structures is a hash table with at most P entries; there are no more internal nodes than there are leaves, so that the total storage for this data structure is O P2 /6 . Note that this analysis is very rough; its purpose is only to convince the reader that the data structure requires space √ polynomial in P and 1/. √ If > 1/ 2, then set Bij = x dx Sij ≤ N/. Leaf nodes occur when N is less than some constant, rather than 1/2 . With these changes, the algorithm described can be used for a point location in an equal balls search over arbitrary dimension as mentioned in Section 2.
optimal -approximate nearest neighbor algorithms
421
4. AN IMPROVED -APPROXIMATE NEAREST NEIGHBOR ALGORITHM This algorithm is a multi-dimensional version of the optimal predecessor algorithm presented by Beame and Fich [4]. This algorithm combines the ideas of that paper with those of Section 3. The key contribution of Section 3 is that it allows multi-dimensional point sets to be reasoned about like one-dimensional point sets. The complicated, three-part √search of Amir √ et al. [1] is not necessary. For this section we assume < 1/ 2 (if ≥ 1/ 2, the same redefinition of neighborhood as used in the previous section can be used). We will first present the algorithm for two dimensions and then describe in §4.5 how it can be extended to an arbitrary dimension. The algorithm will organize subsets P of the point set into trees of large branching factor. A tree of data points will be thought of in two ways: first, as the tree defined by the binary representation of the coordinates of points in P ; second, as a set of nested cells. Each cell is subdivided into a number of sub-cells equal to the branching factor. As in the previous section, each cell of width l contains the points of P lying within the l/ ball containing the cell; this ball is referred to as the neighborhood of the cell. The data structure consists of a forest of such trees. Each tree is associated with a subproblem. A tree associated with a subproblem of universe size N = 2k and point-set size s has a branching factor of 2k /u , where u = O log log N/ log log log N. A node in a tree containing s elements is heavy if its neighborhood contains at least s/n1/u elements. Any ancestor of a heavy node is heavy. There are at most O 1/2 · n1/u heavy nodes per level in the tree—note the factor of 1/2 difference between here and the optimal predecessor algorithm [4]. If there were no overlap between neighborhoods of cells, then there would clearly be at most n1/u heavy nodes per level. But the neighborhoods overlap, increasing the possible number of heavy nodes by a factor proportional to the volume of a ball of radius 1/. 4.1. Hashing for Heavy Nodes This section will prove that one can quickly locate heavy nodes within a tree. In this section we will consider arbitrary dimensions, to prove a more general lemma that will be useful in the sequel. Under the assumption of an unrealistically large word size, using an extension of the parallel hashing algorithm [4], a heavy node will be located in constant time. The trees hold points from a d-dimensional universe, for arbitrary d. If the word size is b and a1 aq are a set of symbols each of which is represented in at most b/q bits, let a1 · · · aq denote a1 aq packed into a single word.
422
matthew cary
The usual representation for d-dimensional points is by coordinate. A point x is represented as x1 xd , where xi is the coordinate of x in the ith dimension. If the universe size is 2k , each coordinate is a kj bit integer. The jth bit of the ith dimension is denoted xi , where x1i is k the most significant bit of xi , and xi is the least significant bit. In this notation, if the word size is at least kd bits, the coordinate form of x is represented as x11 x21 · · · xk1 x1d x2d · · · xkd . Instead of this representation, we will represent a point in interleaved form, where the most significant bits of all dimension are grouped together, followed by the next most significant bits, and so on. Thus x in interleaved form is x11 x12 · · · x1d xk1 xk2 · · · xkd . Using a multiply and a constant mask, a query can be transformed into interleaved form in O d time, if the query fits into a constant number of words. We restate Lemma 17 from [4]. Lemma 4.1 [4, Lemma 17]. Let S1 Su ⊂ 0 2t − 1 be sets of size at most 2r , where r ≤ t. If memory words contain b ∈ tu2 bits, then there is an O tu2r+1u -bit data structure that, given z1 · · · zu , in constant time returns a word x1 · · · xu such that zi = xi iff zi ∈ Si . Using this lemma, we now describe how to find the deepest heavy ancestor of a query in constant time. This lemma applies to a subproblem of the original problem. The original query is in a universe Nd with n points; the subproblem is over a universe N d with s points. The points are stored in a tree T as described above. Lemma 4.2. If memory words contain u3 d log N bits, for a tree T as above, with branching factor 2k /u and depth at most u, there is a data structure that, can determine given a point q ∈ N d represented in interleaved form, the deepest ancestor of q that is a heavy node in T in constant time. The size of the data structure in words is polynomial in n. Proof. Let x be a point in N d . Let the interleaved representation of x be x11 · · · xd1 x1k · · · xdk . Let !i x = x1iu · · · xdiu x1i+1u−1 · · · xdi+1u−1 that is, !i x specifies the ancestor of x at level i in the tree. Set Si = !i v v is a heavy node at level i in T Then each point of Si is represented using t = ud log N bits. Note that as there are O 1/d n1/u heavy nodes per level, there is an r, log n + d log 1/ r∈O u
optimal -approximate nearest neighbor algorithms
423
such that Si ≤ 2r for all i. In fact, we know r ≤ t, as Si is not a multiset. Let q ∈ Nd , in interleaved representation. By Lemma 4.1, there is a data structure that will return a word y ∈ Nd such that !i q = !i x iff !i q ∈ Si . In other words, !i q = !i x iff q has a heavy ancestor in level i. Thus if m is the prefix OR of the exclusive OR of q and x, then the deepest ancestor of q that is a heavy node is found in level m and is identified by !m q. From our bound on t, Lemma 4.2 requires a word with u2 t = u3 d log N bits, which is satisfied by assumption. The size of the data structure follows immediately from the bound on r and the size bound in Lemma 4.1; note that this size is in terms of n, the number of points in the original problem, and not s, the number of points in the subproblem represented by t.
4.2. Maneuvering in the Tree It is instructive at this point to pause and compare this data structure with the stratified tree of the previous section. The branching factor in this tree is much less than that of stratified trees: an exponent of 1/u versus 1/2. This means a node in this data structure has fewer children than a stratified tree; hence the reduction of the universe due to considering a node’s children is much greater than in the stratified tree. The drawback is that the universe of the subproblem defined by a particular child is much too big: N 1−1/u rather than N 1/2 . This could be offset if it were possible to traverse multiple levels of the tree in constant time. Lemma 4.2 provides the tool to do this. Using the hashing there described, multiple levels of the tree can be traversed in one step to find the lowest heavy node v on a path to q. Once v has been found, there are two possibilities. Either q lies below a child of v, or q’s nearest neighbor can be approximated by finding a nearest neighbor to q from the children of q. In the first case, the point set has been reduced by a factor of n1/u ; in the second the universe size has been reduced to 2k /u (see Fig. 2). To make this concrete, for a subproblem with universe 2k and set size s, let k = uc and s = na/u . In case 1 of Fig. 2, s is reduced to na−1/u . In case 2, k is reduced to uc−1 . Thus O a + c parallel hashes are required to reduce this subproblem to either a constant-sized universe or a constantsized point set. Let the size of the original universe be N = 2k . The original size of the point set is n = n1 = nu/u , so that if k = uu the number of parallel hashes done is O u. As k = log N, this is satisfied by choosing u = O log log N/ log log log N.
424
matthew cary Case 1
q
At most s/n
1/u
points in child Case 2
q
2
k 0 /u
children total
FIG. 2. The two cases of the optimal algorithm.
4.3. Detailed Algorithm Description This lemma and its proof are parallel to [4, Lemma 2]. Lemma 4.3. Given memory words that satisfy Lemma 4.2, u ≥ 2, uu ≤ n, s ≤ na/u , k = uc , and = 1/2τ , there is a data structure that supports approximate nearest neighbor queries in an interleaved representation of a set S of s integers from the universe 2k in O c + a time and space, cs O 2 · size of Lemma 4.2 structure We now describe the algorithm that proves the lemma. The data structure is like that of [4, Lemma 2]. Consider the tree with branching factor 2k/u and depth u representing the set S. Take a heavy node v with side length l in universe N. Its children are drawn from the sub-cells with side length l = l/2k/2u . The sub-cells of this size contained k/2u in the l/ neighborhood of v can be seen as a N -universe, N = 2 / ; each sub-cell is one point in the universe. Define S to be the set of points x y in N such that the cell with side length l centered at 2k/2u · x 2k/2u · y contains a point of S (note we are not considering the 1/ neighborhood of the cell). Let Sv1 be the problem defined by S in the universe N . So Sv1 is a subproblem of size at most s in a universe with side length O 2k/2u /. For each point in S store a representative point from the corresponding cell in N; this will be the point returned by a query on Sv1 . Take a heavy node v with side length l, and a non-heavy child w of v, with side length l = l/2k/2u . Let Sw2 = S ∩ Bw l /. As w is non-heavy,
optimal -approximate nearest neighbor algorithms
425
FIG. 3. Optimal tree grid structure. The large, outlined cell is v. The children of v are Sw2 over all of the small outlined cells w. As noted, a cell is a child if there are any points in the -neighborhood—the actual cell may or may not be empty. Sv1 is formed from the universe N of small cells. A cell is in S if it contains a point of S; note how the -neighborhood is not considered in this case.
Sw2 is a subproblem of size not more than s/n1/u ≤ na−1/u in a universe of size 2k (see Fig. 3). The data structure consists of • the data structure from Lemma 4.2; • for each heavy node v, a hash table containing the non-heavy children of v; if the universe of the subproblem Sv1 has side greater than 23+2τ , the recursive data structure for Sv1 (otherwise store the nearest neighbors for the central cell of size 1/ in a hash table of size 1/2 ; • for each non-heavy child w of a heavy node, the data structure for Sw2 ; • for each leaf that contains a single point p ∈ S, that point p; • for each leaf v with unit side, the closest point to v.
426
matthew cary
To find the nearest neighbor to q ∈ 2k , find the longest prefix v of q that is heavy, using the structure from Lemma 4.2. If q is contained in a child w of v, recurse on the data structure Sw2 . If there is no such w, recurse on the data structure Sv1 ; this is easy, as the query is in interleaved form. Note that if Sv1 has a sufficiently small universe, the search on Sv1 is a hash table look-up. If q is contained in a child w of v, and w is a leaf with unit side or a leaf containing a single point, then the single point stored at w is returned. To show the running time, note that recursing through Sw2 for some w c reduces a by one. Recursing through Sv1 for some v reduces a universe 2u to one of size 2u
c−1
· 2τ < 2u
c−1
+τ+1
Arguing as in Section 3, after c reductions, the query point has been located in a universe of size O 23+2τ , which is the base case and is answered in constant time. The correctness of the algorithm is shown similarly to that of Section 3. The error introduced is on the order of the side length of a cell; any time error is introduced, the nearest neighbor is at least a factor of 1/ farther away. Thus the arguments of those sections hold here. Similarly, the accumulation of error through the tree is bounded by 1 + O . The size bound comes from the argument in [4, Lemma 2]; however, as in the bound on heavy nodes, overlapping neighborhoods add a factor of s/2 leaves. 4.4. The Complete Algorithm for Constant Dimension We can use this structure for an algorithm that runs in universe-optimal time for constant dimension. Theorem 4.1. In the unit-cost RAM model with 'log N word size, there is a static data structure for representing n points from the universe Nd , for some constant d, that answers -approximate nearest neighbor queries in time O log log N/ log log log N (the hidden constant depends on d). The size is polynomial in n and 1/d . Proof. Choose u = O log log N/ log log log N. Apply the stratified tree from Section 3 for O 3 log u + log d = O log u levels to reduce the universe size, taking constant time per level. Each level cuts the number of bits in half, so that if the universe considered after the reduction is N d , we have O u3 d log N bits per word, as required for Lemma 4.3. This runs in O u further time, so that the overall time is O u + log u The bound on the size of the data structure follows immediately from Lemmas 4.2 and 4.3.
optimal -approximate nearest neighbor algorithms
427
Note that the running time is independent of , but that the space is proportional to 1/d . 4.5. Extending to Arbitrary Dimension If log d ≤ u, with u = log log N/log log log N, then the algorithm described in the previous section still goes through; however, the space used is unacceptably large, proportional to 1/d ≤ 1/ · N 1/ log log log N , and not to some polynomial of n, as is usually desired. In general, Lemma 4.2 is the bottleneck that restricts our algorithm to fixed dimensions. If word size d log N were feasible, the parallel hashing could be done in constant time; however, typically the word size is taken to depend only on log N. Indeed, as the parallel hashing uses multiplication, it is not even clear that the parallel hash could be done O d time with O log N-sized words. Also, note that the leaf-level lookup table takes space 1 d . There is a similar factor in the space required by other ANN algorithms [1, 8]. If d is constant, this is just a polynomial amount of space and so matches the lower bound [4]. 5. CONCLUSIONS The recent results about -approximate nearest neighbors explored in this paper have all been upper bounds. The results generally involve a factor of 1/d in their space and so do not quite fit into the framework of the lower bounds of [6] and [5], which assume space polynomial in n and d (also, [6] assumes a word size of O d). The lower bound for predecessor [4] (see footnote 1) is most relevant to the algorithms we have presented here. The deterministic stratified tree comes within a factor log log log N of optimal with respect to universe size, considering O d time is required to read the input query. However, nothing nearly as tight has been found in terms of point set size; the ring-cover algorithm [8] is the best known in
this regard but leaves a gap of log3/2 n/ log log n to the lower bound. These two issues point the way to future work on the -approximate nearest neighbor problem. A tight algorithm in terms of universe size is close. It would be a pleasant surprise if the one-dimensional predecessor lower bound is tight for higher dimensions, but the one-dimensional optimal algorithm would have to overcome the serious obstacle of fast multidimensional hashing. The alternative is to show that the stratified tree is optimal for higher dimensions, perhaps through a modification of the lower bound proof of Beame and Fich [4]. Note that the lower bound would have to be extended to -approximate nearest neighbor. All known efficient algorithms
428
matthew cary
for arbitrary dimensional -approximate nearest neighbors use a space proportional to 1/d ; it would be very interesting to find a lower bound for this amount of space. It would also be useful would be to get a lower bound in terms of , for the approximate problem. The problem of finding fast algorithms in terms of point set size is much more wide open. To our knowledge, there is no analog of the fusion trees [7] and exponential search trees [2]. Such a result seems necessary before upper bounds can be satisfied. ACKNOWLEDGMENTS We thank Paul Beame and Piotr Indyk for helpful conversations.
REFERENCES 1. A. Amir, A. Efrat, P. Indyk, and H. Samet, Efficient regular data structures and algorithms for location and proximity problems, in “Proceedings of the 40th Annual Symposium on Foundations of Computer Science, New York, October 17–19, 1999,” pp. 160–170. 2. A. Andersson, Faster deterministic sorting and searching in linear space, in “37th Annual Symposium on Foundations of Computer Science, Burlington, VT, October 14–16, 1996,” pp. 135–141, IEEE Press, New York. 3. S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu, An optimal algorithm for approximate nearest neighbor searching in fixed dimensions, J. Assoc. Comput. Mach. 45, No. 6 (1998), 891–923. 4. P. Beame and F. E. Fich, Optimal bounds for the predecessor problem, in “Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, New York, May 1–4, 1999,” pp. 295–304, Assoc. Comput. Mach., New York. 5. A. Borodin, R. Ostrovsky, and Y. Rabani, Lower bounds for high dimensional nearest neighbor search and related problems, in “Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, New York, May 1–4, 1999,” pp. 312–321, Assoc. Comput. Mach., New York. 6. A. Chakrabarti, B. Chazelle, B. Gum, and A. L. Vov, A lower bound on the complexity of approximate nearest-neighbor searching on the hamming cube, in “Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, New York, May 1–4, 1999,” pp. 305–311, Assoc. Comput. Mach., New York. 7. M. L. Fredman and D. E. Willard, Surpassing the information theoretic bound with fusion trees, J. Comput. System Sci. 47, No. 3 (1993), 424–436. 8. P. Indyk and R. Motwani, Approximate nearest neighbors: Towards removing the curse of dimensionality, in “Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC-98), New York, May 23–26, 1998,” pp. 604–613, Assoc. Comput. Mach., New York. 9. J. M. Kleinberg, Two algorithms for nearest-neighbor search in high dimensions, in “Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, El Paso, TX, May 4–6, 1997,” pp. 599–608. 10. E. Kushilevitz, R. Ostrovsky, and Y. Rabani, Efficient search for approximate nearest neighbor in high dimensional spaces, in “Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, Dallas, TX, May 23–26, 1998,” pp. 614–623. 11. P. Van Emde Boas, Preserving order in a forest in less than logarithmic time and linear space, Inform. Process. Lett. 6 (1977), 80–82.