Singleton indexes for nearest neighbor search

Singleton indexes for nearest neighbor search

Author’s Accepted Manuscript Singleton Indexes for Nearest Neighbor Search E.S. Tellez, G. Ruiz, E. Chavez www.elsevier.com/locate/infosys PII: DOI:...

990KB Sizes 0 Downloads 26 Views

Author’s Accepted Manuscript Singleton Indexes for Nearest Neighbor Search E.S. Tellez, G. Ruiz, E. Chavez

www.elsevier.com/locate/infosys

PII: DOI: Reference:

S0306-4379(16)30082-5 http://dx.doi.org/10.1016/j.is.2016.03.003 IS1123

To appear in: Information Systems Received date: 16 June 2015 Revised date: 20 August 2015 Accepted date: 3 March 2016 Cite this article as: E.S. Tellez, G. Ruiz and E. Chavez, Singleton Indexes for Nearest Neighbor Search, Information Systems, http://dx.doi.org/10.1016/j.is.2016.03.003 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Singleton Indexes for Nearest Neighbor Search E.S. Telleza , G. Ruizb,1 , E. Chavezc,1 a INFOTEC b Universidad

/ C´ atedra CONACyT, M´ exico Michoacana de San Nicol´ as de Hidalgo, M´ exico c CICESE, M´ exico

Abstract The nearest neighbor search problem is fundamental in computer science, and in spite of the effort of a vast number of research groups, the instances allowing an efficient solution are reduced to databases of objects of small intrinsic dimensions. For intrinsically high-dimensional data, the only possible solution is to compromise and use approximate or probabilistic approaches. For the rest of the instances in the middle, there is an overwhelmingly large number of indexes of claimed good performance. However, the problem of parameter selection makes them unwieldy for use outside of the research community. Even if the indexes can be tuned correctly, either the number of operations for the index construction and tuning is prohibitively large or there are obscure parameters to tune-up. Those restrictions force users from different fields to use brute force to solve the problem in real world instances. In this manuscript, we present a family of indexing algorithms designed for end users. They require as input, the database, a query sample and the amount of space available. Our building blocks are standard discarding rules, and the indexes will add routing objects such as pivots, hyperplane references or cluster centroids. Those indexes are built incrementally and will self-tune by greedily searching for a global optimum in performance. We experimentally show that using this oblivious strategy our indexes are able to outperform state of the art, manually fine-tuned indexes. For example, our indexes are twice as fast than the fastest alternative (LC, EPT or VPT) for most of our datasets. In the case of LC, the faster alternative for high dimensional datasets, the difference is smaller than 5%. In the same case, our indexes are at least one order of magnitude faster to build. This superior performance is maintained for large, high dimensional datasets (100 million 12-dimensional objects). In this benchmark, our best index is two times faster than the closest alternative (VPT), six times faster than the majority of indexes, and more than sixty times faster than the sequential scan. Keywords: Nearest Neighbor Search, Auto-tuning indexes, Metric Indexes,

Email addresses: [email protected] (E.S. Tellez), [email protected] (G. Ruiz), [email protected] (E. Chavez) 1 Partially funded by (CONACyT grant 179795) Mexico

Preprint submitted to Elsevier

March 5, 2016

Pivot Selection

1. Introduction Nearest neighbor search is a pervasive problem in computer science. It appears in many applications such as textual and multimedia information retrieval, machine learning, streaming compression, lossless and lossy compression, bioinformatics, and biometric identification and authentication [10, 26], just to name a few. Some applications, e.g. multimedia databases, resort to intermediate representations such as vectors, sets, or strings of symbols. Those representations often produce intrinsically high-dimensional datasets, which in turn may lead to an exhaustive, sequential search at query time even if using an index. This is because nearest neighbor searching is known to be exponentially difficult on the intrinsic dimension of the data as reported in several places (e.g. [10] and [24]). In those situations, the only plausible solution is to use approximate or probabilistic methods, such that speed is traded for the quality of the solution. Examples of approximate techniques are [27, 7, 13, 1, 16, 15]. We aim at tractable instances of datasets, where exact solutions are essential. There are applications where an approximate or a probabilistic approach cannot be used. Think for example in biometric identification. In this application, neither a miss (failing to identify the nearest neighbor of an object) nor a false claim (giving an output which is not the nearest neighbor) are acceptable because both lead to a failure of the identification system. For this particular example, the only possible solution is a sequential scan over all the objects in the database. The usual way to scale such a system is by using massive parallelism. Exact proximity searching is also interesting from a pure academic perspective. It is hard to draw a line between tractable instances and those which only accept an approximate solution due to its high intrinsic dimensionality. One of the sources of this ambiguity is a large number of potential solutions using indexes with claimed low complexity. If a practitioner looks for a solution, the efficiency claims of many papers could be misleading. We will analyze those factors from this practical perspective. The most sensitive issue is the absence of a complexity model capable of capturing the behavior of an index in realistic circumstances. This limitation implies that indexes would be compared experimentally. Even in this setup there are two alternatives, the first one is to count the number of distance computations as the yardstick for index comparison. The rationale behind this choice is to consider distance computation as the leading cost operation, which in turn should allow comparing different indexes using disparate datasets. One problem with this approach is that the intrinsic dimensionality of the data is a critical factor in the performance; hence the supposed independence of the dataset vanishes. The other alternative is to use standard benchmarks to compare all the indexes, and using the average time spent on queries as the yardstick. This method has the disadvantage of being unable to compare 2

between indexes belonging to different authors, in different computer systems, and different papers without implementing everything each time. To avoid this disadvantage we normalized the total query time using as reference a sequential scan. While this measure still hides many practical issues, like cache usage, and the workload in a multi-user environment; it will give a better guide for practitioners. An example of hidden cost using the number of distance computations model is the sequential scan over the data to filter. Clear examples are the AESA algorithm [28] for exact proximity searching, and the Permutation based index [6] in approximate proximity searching. The combination of a relatively cheap distance function and high internal cost can lead to a putative fast index when counting the distance computations, which will be slow in practice. An additional source of unfairness in the comparison of indexes is the memory usage and the preprocessing time overhead for index construction and maintenance. Some indexes are claimed to be competitive, but the construction cost and/or the space overhead are prohibitive. Below we discuss the most competitive indexes in the literature, along with their possible shortcomings. 1.1. A Brief Survey of Exact Indexes In AESA [28] the index consists of the (O(n2 )) distances among the objects in the database stored in a table. For querying, an initial random pivot is selected and using the triangle inequality all the non-relevant objects are filtered. From the remaining objects, the next pivot is selected close to the query using some cheaply computed distance. This process is repeated iteratively until only relevant objects remain in the collection. Those remaining objects will be the answer to the query. The claimed complexity of this method is a constant number of distance computations. However, it is necessary to compute a linear number of arithmetic and logical operations, along with a quadratic complexity in preprocessing and storage costs. A restriction of the same idea is presented in LAESA [17], where a constant number of pivots are used; however, the claimed complexity at search time also increases. Chavez et al. [10], proved that any pivot based metric index requires at least a O(log(n)) random pivots, with n the size of the database. However, the base of the logarithm depends on the intrinsic dimension, needing larger indexes as the intrinsic dimension increases. Above certain intrinsic dimensionality, the optimal number of pivots may not fit in main memory; hence, the rule of thumb is to use as many pivots as they fit. Proceeding in this way reduces the number of distances computed to solve a query, and it is useful for expensive distance functions. However, many of these indexes have a high internal cost, surpassing the cost of a sequential scan. Under this scheme, the selection of pivots is essential to reduce both the memory costs and the number of distances computed. The search time must be several times smaller than the sequential scan, to be of use in practice. Pivot Selection Strategies. Since it is critical for the performance of pivot based indexes, a natural question is how to select good pivots. A fair rule is to select the 3

pivots randomly, but it is well known that the selection affects the performance of the search. Bustos et al. [2] introduced several pivot-selection strategies. The core of their contribution is a method to compare collections of pivot sets, to decide which one has better performance. The authors claim that a better set of pivots will have a distance distribution of the mapped space with a larger mean value. In particular, they propose an incremental selection strategy, which consists in taking a set of N candidate pivots, and select the best one, say p1 . Then, from another set of N candidates select p2 such that {p1 , p2 } are are the best pair of pivots among the candidate pivot set. This procedures is repeated until k pivots are selected. This technique needs to know in advance the proper value of k. The Sparse Spatial Selection [21] (SSS) is a pivot based method that automatically determines and selects the number of essential pivots to be used by a pivot table. The SSS pivots will be affected by the intrinsic dimension of the space and not by the size of the database. The procedure consists in fixing the maximum distance dmax between any two objects in the database; then, the algorithm starts with an empty set and incrementally constructs the set of pivots as follows. At each step, a database object is tested for well coverage by the current set of pivots. When an object is not well covered, then it is promoted to pivot. More formally, an object will be a new pivot if its distance to all the other pivots is greater than or equal to εdmax for a fixed 0 < ε ≤ 1. The goal is to have set of pivots well distributed over the space. The authors conclude, based on their experiments, that the best value for ε should be 0.4; the authors postulate that this constant does not depend on the particular database. The Spatial Approximation Tree (SAT) [18]. Is a metric tree that creates a compact partition of the data. The root a of the SAT is connected to its children N (a), the remaining elements of the database are assigned to the closest children of a. This process is repeated recursively with each children and its assigned elements. The set N (a) has the following two properties: for every u, v ∈ N (a), d(u, a) < d(u, v) and d(v, a) < d(u, v); and for every w not in N (a) there exist z ∈ N (a) such that d(z, w) < d(z, a). For a given element c, multiple N (c) can be build. One way to do that is to take the closest element to c, call it x, and put it into N (c), then, all the points closer to x than to c can not be in N (c) (second property) so they are discarded. From the rest of the elements, take the closest to c and repeat until all the points not in N (c) are discarded. This construction of N (c) will produce the simple SAT. Another way to get N (c) is to take the farthest point y (instead of the closest) and put it in N (c), from the points not discarded by y, take the farthest and repeat. The SAT produced in this way is called Distal SAT (DiSAT), recently described in [8]. Note that we can insert in N (c) any element, discard the corresponding items and insert any other valid point, and repeat this process until all the points are discarded or in N (c). Inserting points randomly in N (c) will produce SATRand . The authors have found that the DiSAT is the best option. This index has good performance and no construction parameters (other than the order of the objects in N (c)); this simplicity of use makes the SAT a fair choice when there 4

is not much knowledge of the database. The List of Clusters (LC) [9]. Is also a compact partition index. For construction, an element p is selected and its m closer elements are assigned to it, they are called the cluster of p. This simple process is repeated, selecting the next element and building its cluster, until all the database objects are in some cluster. For each cluster, only the covering radius is stored in addition to the cluster elements themselves. When doing a query, all the clusters are checked in the same order of construction, sequentially, to see if they may contain an answer. If the query ball is completely contained in a cluster the search stop. The LC needs quadratic construction time for the useful combination of parameters. Using O(log n) bits per item, this index performs the smallest amount of distance computation in datasets of high intrinsic dimensionality; although the correct parameter combination is difficult to discover. The Priority Vantage Points (KVP). Introduced by Celik in [3, 5], KVP is based on the observation that good pivots are those which filter most points at search time, and postulate those pivots are either close or far from the query. From a population of m pivots, each point will be visible only by its 2k best pivots (k closest and k farthest). Only 2kn distances are stored, instead of mn of a pivot table. These savings in space allow packing more pivots. Vantage Point Tree. The VPT [29] is a binary tree where each element of the database is either a node or a leaf. It is constructed recursively from a random point p as the root, then from all the remaining points, the median M is computed, the points x such that d(p, x) < M go to the left side and the rest go to the right side. The value M is stored. For the left and right branches the process is recursive until reaching a leaf with just one element. The search for elements at distance r or less from q is done starting from the root p; if d(q, p) − r ≤ M the left side can be discarded; if d(q, p) + r < M the right side will not contain an answer. Note that sometimes both, left and right sides need to be visited. The Extreme Pivot Table (EPT). Presented in [23, 22] can be seen as a pivot selection method and also as a partition of the database. Each pivot of the EPT has associated some points of the database, the pivot regions. The union of those regions is the whole database. The region associated to a pivot is formed with the objects more likely to be discarded by the pivot. The EPT is a sequence of layers called Pivot Groups (PG). Each PG is formed with disjoint regions (and their pivots) such that their union is the database. With this, each point is in exactly one region per PG independently of the number of pivots. This is an advantage over KVP, because it has many good pivots; and each group uses exactly n distances. The space saved can be used to add more layers, improving the search time. Also, if some information of the database and queries is known, the cost of the index can be predicted allowing the optimal parameter selection. If that data is unknown, it can be estimated by sampling the database. In this

5

regard, EPT is unique among other exact indexes, a competitive EPT can be obtained giving only the available memory as a parameter. Notice that the EPT is optimized to perform a small number of distance computations. 1.2. Our Contribution Our goal in this paper is to produce an index competitive in practice. We will fix a cost function, request a memory bound and a sample of the query set; then, the output would be the optimized index. In this sense, our approach can be considered as a meta-heuristic for optimal index construction. However, instead of adapting existing indexes to be of use with our technique we design a family of indexes fitting our construction scheme. We obtained indexes outperforming the state of the art in most of our datasets. 2. Some Formal Notation A metric space is a tuple (U, d) where U is a domain, and d : U × U → R is a distance function which obeys the following properties ∀u, v, w ∈ U : • Non-negativity: d(u, v) ≥ 0 and d(u, v) = 0 ⇐⇒ u = v. • Symmetry: d(u, v) = d(v, u). • Triangle inequality: d(u, v) ≤ d(u, w) + d(w, v). We will call the database (or dataset) a finite subset S ⊆ U with n = |S| the number of elements of S. The proximity search problem consists on solving the following operations over the database: • Nearest neighbor search, or simply nnd (q), finds the closest item to q in S, i.e., nnd (q) = arg minu∈S d(u, q). This operation can be extended to retrieve the set of k nearest neighbors (k-nn). • Range query (q, r)d . This query retrieves objects in S intersecting the ball of radius r centered on q, for q ∈ U and S ⊆ U i.e. (q, r)d = {u ∈ S | d(q, u) ≤ r}. For simplicity, when the context is clear, we will write (q, r) for (q, r)d and nn(q) for nnd (q). Our discussion will be centered on nearest neighbor queries, simple modifications to the algorithms allow solving range and k-nearest neighbor queries as well. It is clear that nn(q) can be answered by computing the distance from q to all the elements in the database. However, if the database is queried multiple times, and the number of elements is large or the distance function is difficult to compute, it would be advisable to preprocess the database and build an index to have a smaller amortized time in the long run. It is accepted in the literature that the distance axioms are too weak to develop a complete complexity model for metric indexing. The total query time depends on an array of factors that makes parameter tuning difficult and 6

cumbersome. In the absence of a complexity model covering all the relevant aspects, users of this technology have to consider several aspects to make a selection: i) the intrinsic dimensionality and the size of the dataset, ii) the query distribution and search radius, iii) the cost of the distance function, and iv) the underlying computer architecture. Those factors should be accounted for in a complexity model. The practitioner’s ultimate request is to have a smaller (amortized) time for solving a large batch of m queries, as compared to m sequential scans of the database. 3. Auto-tuned Nearest Neighbor Index (ANNI) We will call an index being able to optimize its parameters to best fit a given query sample, dataset, and a fixed search cost function an auto-tuned index. Auto-tuned indexes are required to obey some properties. An auto-tunned index will accept the following operations: • It accepts incremental improvements. • It is ready after each improvement, that is, it can solve queries. • The search cost of queries can be always measured. With the above considerations, the creation of an index is an optimization problem. The goal would be to have an optimal index for a large set of unknown queries, having just a sample of the query set. Due to the high cost of construction, what makes sense is to avoid backtracking in the index construction, and design an incremental process adding more routing objects at each step. We will make a strong assumption backed by both practical and theoretical observations. We will assume the total search cost is a function essentially unimodal and convex. Each step in the optimization will take O(n) distance operations. The above remarks restrict the usable data structures for our index, as well as the selection of routing objects and filtering functions for the construction. The data structures should support incremental improvements. A generic sketch of the auto-tuned index construction is as follows: We start with the database, a set of queries, and an empty index. One step consist in σ improvements. From the empty index we advance in steps of size σ, measuring the total search cost of the query sample, until it increases. At every step all the queries in the sample are used. The above heuristic is sub-optimal even with the assumed convexity and unimodality, because we did not stop at the optimum; which lies between the second-to-to-last and the last step. In practice, adhering to this suboptimal procedure have a small impact on the total query time. It is worth noticing, however, that backtracking could be done in principle using persistent data structures as described in [20]. Notice that the full process depends on the query set. If we have precise knowledge of the query distribution, we will end up with a finely tuned index. From a practitioner’s point of view, having a query sample should not be a 7

internal external total

Figure 1: Behavior of the cost function. Internal and external costs are monotonic, increasing and decreasing respectively. The total cost is the sum of both. The diamond shows the location of the minimum total cost.

problem. At a last resort, one can use a sample of the database instead. In the experimental section we used the last approach. The search cost could be total search time, number of distances or a combination of both. The construction of the index will be discussed in detail later. For now, let us consider the effect of the step size (the number of improvements σ) on the search cost. 3.1. Measuring the Search Cost The complexity model used in metric indexes splits the cost in two main terms: the internal and the external costs. The former refers to the cost of filtering the candidates relevant to a given query, while the later is the cost of the verification of the candidates and is proportional to its size. This is described below in Equation 1 cost = internal cost + external cost

(1)

More filtering implies, on the one hand, a smaller candidate set and a corresponding lower external cost. On the other hand, more filtering also imply a more precise cut point increasing the internal cost. This trade-off is illustrated in Figure 1. While the external cost is a monotonic decreasing function, the internal cost is a monotonic increasing function. Our method will be optimal on convex cost functions; which have a form depicted in Figure 1. We will search for a local minimum, which will coincide with the global minimum in this circumstances. In this case, the best choice we have is to set σ = 1. When the cost function is the total search time, the shape will be a convex, unimodal function plus noise. The noise comes from fluctuations of a multi-user OS, the occupancy of the memory hierarchy, data locality and in general some unpredictable small variations. A greedy approach will stuck in false global minimum because of the noise. This behavior is the reason of introducing the σ > 1 as a parameter. We will attempt to overcome local minimums by making σ improvements at once. If we set σ to a relatively large value (e.g. 64), we can surpass the local minimum and continue in the search for a faster instance. 8

We can even take the cost of the queries at each improvement and after σ of them, take their average and compare it with the average of the previous σ improvements. Consider an Extreme Pivot Table [23, 22] that counts the number of distance evaluations as cost, Equation 1 is written as:

cost

= m` + ns` ,

(2)

where m is the number of pivots per group, ` is the number of groups, n is the size of the database, and s the discarding probability of the index. Notice that s is a function of m, the query itself, and the characteristics of the database, such as the distribution and the intrinsic dimensionality. Equation 2 is related to the cost for a traditional pivot index, whose expression is as follows:

cost

= k + ntk ,

(3)

where t is the discarding probability of a single pivot (as s, t captures some of the complexity of both query and database). This expression governs indexes with pivot tables. The complexity model given of Equations 2 and 3, is discussed in detail in [23, 22] for optimal values of m, `, and the proper definition of s; also, [10] makes the same analysis for k and t. In this paper, our goal is to provide a practical tool for practitioners. We will estimate the cost online, using the query set and the tools from the running system. We will care more for total search time. 3.2. Auto-tuning Nearest Neighbor Indexes Definition 3.1 (ANNI Index). Given a database S[1, n] = u1 , u2 , . . . , un and a set of pivots P ∗ ⊂ S with size m; an Auto-tuning Nearest Neighbor Index (ANNI) is composed of two arrays: P [1, n] and D[1, n]. Those arrays are dynamic, the content vary during index construction and tuning. P and D are defined as follow: • P [i] = piv(ui ), the pivot associated to ui . Some technicalities i) We use objects and indexes indistinctly to describe items, i.e., piv(u1 ) ∈ [1..n] ii) Pivots have distinguished entries in P , i.e., P [i] = 0 ⇐⇒ ui ∈ P ∗ • D[i] = d(ui , piv(ui )). Notice that D[i] = 0 when P [i] = 0. • P and D define a Dirichlet domain, i.e., ∀u ∈ S r P ∗ , d(u, piv(u)) = min∗ d(u, p). p∈P

9

A range search can be done using ANNI as a pivot table, with one pivot for each database element. Each element u of the database check if pivot piv(u) can discard it, if it cannot, then the distance between the query and u must be computed and, u could be eventually reported in the result. This is formalized in Algorithm 1. Algorithm 1 Searching with an ANNI index. Input: The database S[1, n] = u1 . . . , un , the ANNI index (P, D), see Definition 3.1, and the query (q, r) Output: The result set R satisfying the query Let H(a) be a cache map storing the distance among q and a Populate H with P ∗ 3: for i = 1 to n do 4: Define dpq = H(P [i]) {The distance between the covering pivot and query} 5: if D[i] = 0 then 6: if dpq ≤ r then 7: R ← R ∪ {(dpq , S[i])} 8: end if 9: else if |dpq − D[i]| ≤ r then 10: Define di = d(q, S[i]) 11: if diq ≤ r then 12: R ← R ∪ {(diq , S[i])} 13: end if 14: end if 15: end for Comments on solving k nearest neighbor queries. It is not difficult to adapt this algorithm to support k nearest neighbors queries; the idea is to bound iteratively r, using a priority queue of fixed size for R; in particular, the D[i] = 0 condition and line 2 can be adapted to bound r rapidly. R is a set of tuples instead of simple objects to make easy to clarify the usage of priority queues. 1:

2:

Definition 3.2 (NANNI). A Node Auto-tuning Nearest Neighbor Index (NANNI) is an ANNI implemented with clusters. Each database object have its associated cluster center or pivot. Given a database S[1, n] = u1 , u2 , . . . , un and P ∗ = p1 , p2 , . . . , pm a NANNI index is composed by two functions:: items and cov: • items(p) = {(u, d(u, piv(u)))| piv(u) = p}. • cov(p) = max(u,du )∈items(p) {du }, this is called the covering radius of p. • As ANNI, NANNI also defines a Dirichlet domain. • The search algorithm is similar to ANNI, however since we have clusters, we have the following modifications: 10

i) The search algorithm is modified to prioritize closer pivots to the query q. ii) For each pivot p, its region can be safely discarded when d(q, p) > cov(p) + r (ball condition), see Figure 2. iii) Also, the hyperplane condition could help to discard a region when d(q, p) > dnearest-p + 2r, where dnearest-p is the distance to the closer pivot to q, see Figure 3. • It can be constructed as follows: 1. It is constructed processing a working ANNI (Algorithm 4), just fusing data into clusters to efficiently get items and cov. 2. Or iteratively, a construction algorithm randomly promotes objects to pivots, then each new pivot will steal objects from other pivots to maintain a valid Dirichlet domain (Algorithm 2). A NANNI where the cost function is the time will be called TNANNI. Algorithm 2 The construction of NANNI. Input: The database S[1, n] = u1 . . . , un , the set of training queries Q, the length of the steps σ, and the function cost f to minimize. Output: The new NANNI (P ∗ , items, cov ). 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Set prev cost ← n + 1 and curr cost← n. Set P ∗ = {p1 } for some random object p1 . Populate items(p1 ) with all the objects of S. Compute cov(p1 ). while curr cost ≤ prev cost do prev cost ← curr cost Add σ items to P ∗ and arrange all the objects on items(P ∗ ) to maintain a Dirichlet domain. Also update all the cov’s. Search for all the queries in Q and evaluate the cost function f . Store the average cost of the queries in curr cost. end while

Definition 3.3 (MANNI). A Multiple-ANNI (MANNI) is defined as a collection of ` ANNI’s, M = I1 , . . . , I` . In particular, I1 is called the leader and is a NANNI index, the rest are plain ANNI’s. – Each one of the ` ANNI is constructed independently, using Algorithm 4 reporting ` instances. The leader is then converted to a NANNI index. – Algorithm 3 describes the searching procedure, coordinating both NANNI and ANNI indexes.

11

r q

cov(p)

p

d(p,q)

Figure 2: If the covering ball of a node p does not intersect the query ball then all points of this node can be discarded.

Figure 3: The d(q, p) > d(q, p0 ) + 2r criterion (where p0 is the closest node to q) says that if the query is on the right side of the hyperbola, the points of p can be discarded.

Figure 4: A MANNI index. The top figure corresponds to NANNI and the bottom matrix illustrates the ` − 1 ANNI’s. Notice that piv is parametrized by the number of ANNI denoting the independence among sets of pivots.

12

– Since MANNI is composed of several indexes, and those indexes have functions and internal structures, when necessary we will add a subscript to the names to clarify the target index, for example, suppose we have indexes A and B, our notation changes to itemsA , covB , DA , PB , etc. The organization of MANNI is depicted in Figure 4. Notice that items(p) does not store actual items and just a list of references to them. From a most basic point of view, the set of pivots with the set of all items(p) define a permutation of the dataset. The same is true for each ANNI (here, the permutation is the identity). Algorithm 3 Searching with a MANNI index. Input: The database S[1, n] = u1 . . . , un , the MANNI index M = I1 , . . . , I` , (see Definition 3.3) and the query (q, r) Output: The result set R satisfying the query Let H(a) be a cache storing distances between q and a Populate H with the distances from q to all pivots in I1 , · · · , Iσ 3: Select P ∗ from I1 and order them with respect to their distance to q 4: Compute dnearest-p as the smallest distance from a pivot in P ∗ to q 5: for p ∈ P ∗ do 6: Define dpq = H(p) 7: if dpq ≤ covI1 (p) + r and dpq ≤ dnearest-p + 2r then 8: for (i, dpi ) ∈ itemsI1 (p) do 9: if |dpq − dpi | ≤ r then 10: for j = 2 to ` do 11: Try to prove in Ij that d(q, S[i]) > r using the same rules than Algorithm 1 {This does not evaluate d(q, S[i]))} 12: end for 13: if there is no proof that d(q, S[i]) > r then 14: Define diq = d(q, S[i]) 15: if diq ≤ r then 16: R ← R ∪ {(diq , S[i])} 17: end if 18: end if 19: end if 20: end for 21: end if 22: end for Comments on solving k nearest neighbor queries. The strategy explained in Algorithm 1 can be applied, notice that bounding r after line 2 is crucial. 1: 2:

Definition 3.4 (DMANNI and TMANNI Indexes). DMANNI is a MANNI index designed to reduce the number of distance computations. As MANNI, it is 13

Algorithm 4 Construction of an ANNI Index Input: The database S[1, n] = u1 . . . , un , a training query set Q[1, m], the number of sister indexes `, the step’s length σ Output: The new ANNI index (P [1, n], D[1, n]) 1: 2: 3: 4:

5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:

24: 25:

Let P [1, n] be the table where we will storage the rank of the pivot associated to ui , i.e. piv(ui ) = uP [i] Let D[1, n] be the table for storing d(ui , piv(ui )) Let prev cost ← n + 1 and curr cost ← n Let H(a, b) be a cache map for storing the distance among a and b, if pair (a, b) is undefined then it computes d(a, b), also, it adds the new pair and distance to H. Initialize P and D using a pivot randomly selected from S while prev cost > curr cost do prev cost ← curr cost for i = 1 to σ do Randomly select pivot id c from [1..n] P [c] = 0, D[c] = 0 {Tag c as pivot} for j = 1 to n do if P [j] > 0 and H(S[c], S[P [j]]) ≤ 2D[j] then {The stealing condition (Figure 5) does not apply for S[j]} Define dj = d(S[c], S[j]) if dj < D[j] then D[j] ← dj P [j] ← c end if end if end for end for cost ← average of the cost of searching all q ∈ Q, using Algorithm 1 Empirically compute the discarding probability of this index, γ. Recall cost = m + nγ, where m is the number of distinct items in P , i.e., γ = (cost − m)/n Adjust the estimated cost for ` identical instances of the ANNI index created with i.i.d.r.v. assumptions, i.e., curr cost ← m` + nγ ` end while

14

d(c, c’) c

u d(u, c)

c’ d(u, c’)

d(c, c’)/2

(a) Stealing condition

(b) Reconfiguration of the Dirichlet domain after c0 is promoted as pivot.

Figure 5: When c0 is promoted as a new pivot, it defines a new region. To maintain the properties of ANNI, all items located in the new region should be associated with c0 . Thus, c0 must steal u from c when d(c, u) > d(c0 , u). The left figure illustrates the case denying the stealing of u. On the right, an example of the transformation of the underlying Dirichlet domain after c0 becomes a pivot, all items in the gray region must be associated to c0 . The example lives in a two dimensional space under the euclidean distance

Algorithm 5 Auto-tuning of a DMANNI Index Input: The database S[1, n] = u1 , . . . , un , a training query set Q[1, m], the number of indexes `, the step’s length σ Output: The new DMANNI index as a collection of one NANNI and ` − 1 ANNI instances 1: 2: 3: 4: 5: 6: 7:

8: 9: 10: 11: 12:

Create one empty NANNI and ` − 1 empty ANNI. Let prev cost ← n + 1 and curr cost ← n while prev cost > curr cost do prev cost ← curr cost for i = 1 to σ do for I ∈ DMANNI do Promote an item to pivot and re-arrange objects to maintain a Dirichlet domain. i) For NANNI it implies to apply the stealing pivot procedure as described in Definition 3.2. ii) ANNI will need to apply the procedure starting at line 9 of Algorithm 4. end for end for Search all q ∈ Q using Algorithm 3 and store the cost. curr cost ← average of the cost of the searches. end while

15

composed of a collection of one NANNI and some ANNI instances. A DMANNI works as follows: – The search algorithm is the same of MANNI. – The construction is described in Algorithm 5. Notice how pivots are promoted in all instances before measuring the search cost. This contrasts with the independent construction and estimation policy of MANNI. TMANNI is identical to DMANNI but it optimizes the total search time instead of the number of distance computations. It is a practical way to incorporate all the hidden costs in the operating system and the index’s internal complexity. It is straightforward to adapt Algorithm 5 to measure the cost as real time. 3.3. Analysis of the Search Cost The ANNI uses the separation in classes of the Dirichlet domain and the verification of the pivot methods, i.e., a point u is discarded when |d(u, cu ) − d(q, cu )| > r for a query q with radius r where cu is the center of the class of u. It has a cost nSA + m where n is the size of the database, m is the number of centers and SA is the probability for the ANNI to discard a point. Now, we are going to analyze SA . Let Xci be the random variable Xci (u) = d(ci , u) for u in the class of the center ci , Yci (q) = d(ci , q). We will assume that Xci and Yci are independent identically distributed random variables (i.i.d.r.v.). If Xcu is the random variable of the class of u then SA = P r(u is discarded) = P r(|Xcu (u) − Ycu (q)| > r). This probability is similar to the pivot based algorithms analyzed in [10]; however, the difference is that Xci is defined only for those elements associated to ci . Therefore, the elements in this class can only be discarded by ci . Since the pivot covering an object is not random, they are close, then the probability of discarding a covered object is larger, although there is only one pivot to cover an object; and hence the probability SA is smaller than the random pivot table discarding probability. This is illustrated in the example in Figure 6. At first sight, this would make ANNI less competitive than pivots; however, we can use more pivots as MANNI does it. It is well known that pivots near and far from a query are more likely to discard objects [3, 4, 23]. In our case, we store just the near pivots. The whole picture is that both the external and the total number of evaluated distances, at a minimum increasing in the internal cost. The NANNI is similar but has two extra conditions. First, we check whether d(ci , q) > cov(ci ) + r where cov(ci ) is the maximum distance from the center ci to the elements on the cell. If the expression is true, we can discard the entire cell of ci . If not, we verify if d(cq , q) < d(q, ci ) − 2r where cq is the cell of q, to

16

q

p

Figure 6: If a center cannot discard a group of points, no other center can and those points must be manually checked.

see if we can discard the cell of ci . Finally, if both tests fail, we use the ANNI criterion. Note that the latter condition implies the first, and hence the first condition does not affect the probability of discarding. So, we have SN = P r(u is discarded) = P r(d(cq , q) < d(q, cu )−2r∪|d(u, cu )−d(q, cu )| > r) = P r(Ycq (q) < Ycu (q) − 2r ∪ |Xcu (u) − Ycu (q)| > r) which makes the NANNI to have a smaller external cost. The MANNI is a combination of a NANNI and multiple ANNI’s. The cost is `−1 n(SN ∗ SA ) + `m. Remember that the ANNI and NANNI (and in consequence MANNI) have the properties required for using the auto-tune method. In the case of the DMANNI (and TMANNI), we could auto-tuned the NANNI and each ANNI individually, but we see the DMANNI as a whole. We want the auto-tuning method to be a sort of meta-heuristic and use the indexes as black boxes. We will essentially add indexes until the performance drops. 4. Experimental Results Our experimental setup is standard for testing indexes. A common dataset is used for all the indexes, the algorithms run on the same machine, implemented in the same language. The first part corresponds to real world datasets, the second part corresponds to random vectors on the unitary cube to test the dependency on the dimension and the database size. 4.1. Description of the Experiments We used some standard, real world databases such as Nasa, Colors, and, English Wiktionary. We also used randomly generated databases (RVEC) of several dimensions and sizes, as listed below. Next we describe them. — Nasa This database is a collection of 40, 150 vectors of 20 coordinates obtained from the SISAP project (http://www.sisap.org). It uses L2 as 17

distance function. A sequential search completes on 0.0140 seconds in our testing machine. — Colors The second benchmark is a set of 112, 682 color histograms (112dimensional vectors) from SISAP, under the L2 distance. A sequential search uses 0.165 seconds. — Wiktionary The third benchmark is the English Wiktionary, a dictionary with 736, 639 entries with Levenshtein’s distance as metric. A sequential search completes in 0.940 seconds. — RVEC In the one hand, for measuring the effect of varying dimension, we used one million sized dataset of 4, 8, 12, 16, 20 and 24 dimensions. A query is solved by exhaustive search in 0.078, 0.096, 0.113, 0.134, 0.150, and 0.170 seconds, respectively. On the other hand, the performance as the size increases is measured using seven 12-dimensional datasets containing 105 , 3× 105 , 106 , and 3 × 106 , 107 , 3 × 107 and 108 items. The time for exhaustive search over these collections are 0.011, 0.034, 0.113, 0.338, 1.134, 3.411 and 11.277 seconds, respectively. All these datasets use the L2 distance. We used the distance function as a black box, without using the coordinate information. This is also standard in benchmarking metric indexes. Each plot depicts the average of 256 nearest neighbor queries. Query objects were not indexed. Along with our contribution (NANNI, TNANNI, MANNI, DMANNI, and TMANNI) we used several state of the art metric indexes, as well as canonical representatives of the three known discarding rules (i.e. pivots, Dirichlet domains and clusters) as baseline for the comparison. Indexes Compared. 1. Sequential or exhaustive scan to bound the searching time when the dimension is large. 2. LAESA, the standard pivot table. 3. List of Clusters (LC) [9], which holds the best performance at equality of memory (using the correct setup). The LC cannot improve its performance adding more memory. 4. The fourth baseline consists of two versions of SAT. The legacy one by Navarro’s SAT [18] and the Distal SAT [8]. The SAT is a well known parameterless metric index. 5. A fifth baseline is the incremental selection of pivots BNC-Inc [2]. 6. Spatial Selection of Sparse Pivots (SSS) of [21]. 7. K Vantage Pivots (KVP) by [5]. 8. EPT [23, 22], a fast and small index, using the memory as the unique parameter. 18

9. VPT [29], a well known parameterless index. In all cases, our ANNI indexes were created using 64 random elements of the database as the training query set, fixing the step σ to 128 for NANNI and TNANNI (see Algorithm 4), and 512 as a sum of all steps in the individual instances for MANNI, DMANNI, and TMANNI (see Algorithm 5). The algorithms were implemented in C# with the Mono framework2 . Algorithms and indexes are available as open source software in the natix library3 . All experiments were executed in a 16-core Intel Xeon 2.40 GHz workstation with 32GB of RAM, running CentOS 5.5 without exploiting the multicore architecture for search experiments. 4.2. Index Comparison The comparison method consists in contrasting all the indexes using a single dataset and iterate over several representative datasets. We will prove experimentally that our self-optimized indexes are faster than the state of the art alternatives. Incidentally, we help to clear the notion that reducing the number of distance computations leads to faster indexes, showing experimental evidence that this is only valid when the complexity of the indexes is low. Colors. We start with the dataset Colors. Figure 7(a) shows in the x-axis the size of the index in memory and in the y-axis the number of distances computed as a fraction of the database size (number of distances/elements of the database). Here the LC, NANNI and TNANNI have the smallest memory footprint. They also compute more distances than most of the other indexes. If we consider only this measure we will arrive at the misleading conclusion that those other indexes are better. However, when comparing these results with the plot in Figure 7(b) size of the index vs. speedup, the NANNI and TNANNI stand apart being faster than most of the other indexes. This is one of the cases when a “bad” index in terms of number of distances computed, is the fastest. As we had mentioned, this is because of various factors that depends of the metric space and/or the index itself. We should not focus on just how many distances an index computes, performance metrics should be more inclusive. Comparing just the number of computed distances is a necessary abstraction for a more comprehensive analysis, and has its theoretical importance. However, the inner complexity of the indexes (requiring for example a sequential scan of the database with a cheaper filtering procedure) makes them unsuitable from the perspective of a practitioner. One extreme example is SSS, it computes a very small number of distances but it is the slowest and also is the one that needs more space. This indicates that it has a lot of pivots that does not help for discarding an element. The pivots really needed are just a few. Compare the SSS with the EPT for example, some EPT have the same number of distances computed 2 http://www.mono-project.org 3 http://github.com/sadit/natix/

19

LAESA SSS BNC

DiSAT SAT EPT

KVP VPT LC

DMANNI TMANNI NANNI

LAESA SSS BNC

TNANNI MANNI

0.25 0.20

speedup

# distances / n

0.30

0.15 0.10 0.05 0.00 100

1000

10000

100000

1e+06

memory

9.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 100

DiSAT SAT EPT

1000

KVP VPT LC

10000

DMANNI TMANNI NANNI

TNANNI MANNI

100000

1e+06

memory

(a) Number of computed distances, bottom-left (b) Speedup as compared with sequential corner is better search, top-left corner is better

Figure 7: Performance comparison among our indexes and the state of the art alternatives over Colors database.

but are smaller and faster, that means the EPT have some key pivots boosting searches. In a more general reasoning, in a pivot based method, less space and a larger speedup imply better pivots; hence, being NANNI and TNANNI efficient implies they have good pivots and also the proper number of them. From the total query time results, we conclude that the indexes MANNI,TMANNI, and DMANNI have similar performance. As the amount of memory increases, the searches become slower even when the number of distances computed is decreasing. This is because the indexes have too many pivots. That happens very early in this database because of its small size. This behavior is analyzed in more detail in the next databases of larger size. It is also worth noting that, at same memory usage, our methods are very competitive. They compute about the same number of distances than EPT and KVP, being faster most of the time. This becomes apparent in the speedup plot (Figure 7(b)). The LAESA, SSS, and BNC indexes are left behind in this and all the other tests. The SAT’s only advantage is the space, it needs less memory than most of the others (but more than the NANNI and TNANNI). The space advantage becomes a drawback because there is no mechanism to allow SAT to use more space. The DiSAT computes less distances than the simple SAT and is faster but even the DiSAT ranks below our indexes in this and the other experiments. The VPT needs to compute many distances and its speedup is low in this and all the tests except with the larger dataset. The best indexes for Colors (Figure 7) are DMANNI, EPT, DMANNI, all of them using from 2 to 4 groups. Their speedup goes from 7 to more than 8.6. Other small indexes like NANNI, TNANNI, VPT and LC also perform well. The first two show a speedup higher than 7; VPT and LC have a smaller speedups, yet both outperform several indexes using more memory. Nasa. The format of the experiment is the same as before. The results are summarized in Figure 8. The fastest indexes for the Nasa database are TMANNI, NANNI, and TNANNI with speedups from 8.9 to 9.7. In fact, the best indexes

20

0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 100

DiSAT SAT EPT

KVP VPT LC

DMANNI TMANNI NANNI

LAESA SSS BNC

TNANNI MANNI

speedup

# distances / n

LAESA SSS BNC

1000

10000

100000

1e+06

memory

10.00 9.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 100

DiSAT SAT EPT

1000

KVP VPT LC

10000

DMANNI TMANNI NANNI

TNANNI MANNI

100000

1e+06

memory

(a) Number of computed distances, bottom-left (b) Speedup as compared with sequential corner is better search, top-left corner is better

Figure 8: Performance comparison among our indexes and the state of the art alternatives over Nasa database.

are among the ANNI family followed by the LC, VPT, and DiSAT. Here the smallest indexes perform better since the database is small. If we use more pivots than necessary, we will have worst search times even when the number of distances steadily decrease. Note how the ANNI family is always faster even for indexes using more memory, this tells us the efficiency of the structure, and how it is adjusted for the instance. Wiktionary. In Figure 9 the NANNI is small, computes just a few distances, and is very fast. In Figure 9(b) the DMANNI and TMANNI have a similar tendency with the DMANNI being faster. The MANNI is a bit different because the construction method assumes the i.i.d.r.v. for the data. It reaches the faster speed with small memory. The only index comparable to them is the ETPT. Here, the VPT is specially affected in the total search time because of the large number of costly levenshtein distances. This dataset is 6.5 times larger than Colors. Notice that edit distance is quadratic, hence it is relatively costly. Here, the top-10 faster indexes are the LC in the head (12 times faster than sequential search), DMANNI-8, MANNI2, having speedups higher than 11. EPT with 4, 8, 16, have speedups up to 10.6 while NANNI is 9.7 times faster than the sequential search. Note how the smaller indexes are still good, but now some larger indexes perform better. This is because the cost of the distance function. Please notice that our singleton indexes are among the faster indexes. The LC is the fastest and uses the smallest amount of memory (as discussed, this result should be taken with caution, since tuning the LC and the construction time makes it impractical, see Table 1 in Section 4.4.3). Disregarding the memory usage TMANNI and DMANNI are the fastest followed closely by the EPT. Random Vectors 12 coordinates. In this database, the objective is to test an instance of known dimension, in this case, a 12-dimensional dataset. We can observe in Figure 10 that most of the indexes are comparable in the number of computed distances with NANNI and TNANNI unnoticeable. In contrast, for 21

DiSAT SAT EPT

KVP VPT LC

DMANNI TMANNI NANNI

LAESA SSS BNC

TNANNI MANNI

0.60

14.00

0.50

12.00

0.40

speedup

# distances / n

LAESA SSS BNC

0.30 0.20 0.10 0.00 1000

DiSAT SAT EPT

KVP VPT LC

DMANNI TMANNI NANNI

TNANNI MANNI

10.00 8.00 6.00 4.00 2.00

10000

100000

1e+06

memory

0.00 1000

10000

100000

1e+06

memory

(a) Number of computed distances, bottom-left (b) Speedup as compared with sequential corner is better search, top-left corner is better

Figure 9: Performance comparison among our indexes and the state of the art alternatives over the Wiktionary database.

the speedup, in Figure 10(b), we can see that the NANNI and TNANNI are remarkable, also DMANNI is the fastest comparing with the others using the same amount of memory. The only competition at large memory usage is with the EPT that keeps a consistent performance. This is not a surprise because those methods want to construct an optimal index, the advantage of the ANNI is a practical construction construction algorithm. The distance function on this benchmark (L2 ) is relatively fast. Hence, this benchmark represents the case for relative cheap distances with a large intrinsic dimensionality. This is a hard case for traditional metric indexes since most of them are designed to work for costly distance functions. Figure 10 shows how most indexes barely improve the sequential search. In this case, the smaller indexes excel in performance. Both VPT and DiSAT are the exceptions, perhaps due to the intrinsic dimensionality of the dataset, but NANNI and TNANNI have the better performance. All the MANNI indexes family with two indexes also perform well. EPT and KVP have limited performance, yet they work relatively good on low memory. Larger memory indexes perform significantly well on distances, even if they are not an option for practical uses (and low-cost distances). Summary. In all the above experiments, we can see that all indexes designed to use low memory are consistently faster. Indexes using many pivots perform a few distance computations per query. However, not all indexes computing few distances to solve a query are good in practice. See for example BNC and SSS, they perform badly in terms of real time. We postulate a more precise rule for creating good indexes: Reduce the number of distances and use a simple indexing machinery. Notice that those objectives are contradictory, and also notice that the overall performance depend on the instances of the distance function, the database and the query set. Our approach accounts for all those parameters and balance them for the given workload, this is the main reason behind the good performing of our indexes.

22

DiSAT SAT EPT

KVP VPT LC

DMANNI TMANNI NANNI

LAESA SSS BNC

TNANNI MANNI

0.60

12.00

0.50

10.00

0.40

8.00

speedup

# distances / n

LAESA SSS BNC

0.30 0.20 0.10 0.00 1000

DiSAT SAT EPT

KVP VPT LC

DMANNI TMANNI NANNI

TNANNI MANNI

6.00 4.00 2.00

10000

100000

1e+06

1e+07

memory

0.00 1000

10000

100000

1e+06

1e+07

memory

(a) Number of computed distances, bottom-left (b) Speedup as compared with sequential corner is better search, top-left corner is better

Figure 10: Performance comparison among our indexes and the state of the art alternatives over the RVEC-12-1M database.

4.3. Performance when Dimensionality Grows The curse of dimensionality is a well known phenomenon limiting the performance of metric indexes. As the dimension increases, the relative gain in speed is shattered and it is even possible that an indexed search become more costly than a sequential search. We fixed the database size to one million objects and vary the dimension. The databases used were randomly generated in the unitary cube. The dimensions we tried were 4, 8, 12, 16, 20, and 24. The results are shown in Figures 11 and 12. We grouped the results by space usage. First, in the smallest indexes, we let them take 10 or less integers per element in the dataset (Figure 11(a)). In this setup we see that the best choices are NANNI and TNANNI. Next, we increased the size to allow the indexes to take between 10 and 30 integers per item (Figure 11(b)). Here, the MANNI family is faster. Finally, allowing between 30 and 100 integers we have the second largest instances (Figure 11(c)), and allowing up to 300 integers per item we have the largest instance (Figure 11(d)). The result is the same: the MANNI family had a clear dominance on small dimensions. Here, our empirical construction takes a big advantage because of the nature of the dataset: since our training queries are similar to the actual queries, the resulting index is very close to the optimal. For large dimensions, all the methods collapse and are equally bad; this is a known effect of the curse of dimensionality. The set of plots in Figures 12 shows the number of distances grouped in the same way as before. Here, we can see what happen with the number of distances when the dimension increases. Figure 12(a) shows that the ANNI have the smaller increase. On figures 11(b) and 12(c) we see that the EPT is the best, closely followed by DMANNI and MANNI. In Figure 12(d) we note that the DMANNI surpasses the EPT. Summary. In this set of experiments, we fixed the size of the dataset and varied the dimension. For this matter we use synthetic vectors under L2 ; thus the distance function is cheap. Note that in dimensions 4, 8, and 12, our indexes are

23

100 90 80 70 60 50 40 30 20 10 0

SAT EPT VPT

LC DMANNI TMANNI

LAESA SSS BNC

NANNI TNANNI MANNI

EPT KVP DMANNI

TMANNI MANNI

60 50 speedup

speedup

LAESA BNC DiSAT

40 30 20 10

4

8

12

16

20

0

24

4

8

12

dimension

(a) up to 10 integers LAESA SSS BNC

EPT KVP DMANNI

LAESA SSS BNC

TMANNI MANNI

20

speedup

speedup

25 15 10 5 4

8

12

20

24

20

24

(b) 10 to 30 integers

30

0

16 dimension

16

20

24

dimension

20 18 16 14 12 10 8 6 4 2 0

4

EPT KVP DMANNI

8

TMANNI MANNI

12

16 dimension

(c) 30 to 100 integers

(d) 100 to 300 integers

Figure 11: Search speedup for increasing dimensionality, higher is better. The figure shows performance for indexes using up to 300 integers per item. One million RVEC datasets.

24

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

SAT EPT VPT

LC DMANNI TMANNI

LAESA SSS BNC

NANNI TNANNI MANNI

# distances / n

# distances / n

LAESA BNC DiSAT

4

8

12

16

20

24

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

4

EPT KVP DMANNI

8

TMANNI MANNI

12

dimension

(a) up to 10 integers

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

EPT KVP DMANNI

4

8

LAESA SSS BNC

TMANNI MANNI

12

20

24

20

24

(b) 10 to 30 integers

# distances / n

# distances / n

LAESA SSS BNC

16 dimension

16

20

24

dimension

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

4

EPT KVP DMANNI

8

TMANNI MANNI

12

16 dimension

(c) 30 to 100 integers

(d) 100 to 300 integers

Figure 12: Search cost measured as the number of distance evaluations for increasing dimensionality (RVEC-*-1M), lower is better. The figure shows performance for indexes using up to 300 integers.

25

the fastest (Figure 11). On the remaining dimensions our indexes are among the best, although the difference are less noticeable. It is interesting to notice how in low dimension, DiSAT is very competitive; however, it has not a sustained performance as other small options like NANNI or TNANNI. The reason is that our indexes adjust their parameters to the dataset while DiSAT (or even VPT) are really parameterless. From the point of view of an end user, an index without parameters and a self-adjusting index are similar. This is not the case of LC, which needs a lot of computational effort to obtain a good setup, along with some minimal skills from the end-user. However, a properly configured LC is very competitive. The results in Figure 12 shows that memory usage is an important resource. The larger the allowed memory, the faster the index. We must say that this performance is only translated to small real time whenever the distance function is really costly. The best performing indexes on terms of the distance are the MANNI family along with EPT. Here, the performance of KVP and SSS is good too. The LC, as before is very competitive. 4.4. Performance when n Grows An interesting case is when all the parameters, but the size of the dataset, are fixed. The next set of results show the behavior of the indexes for increasing database size. The exact sizes used were 105 , 3 × 105 , 106 , 3 × 106 , 107 , 3 × 107 , and 108 . 4.4.1. Medium size databases The results are grouped, as before, by the amount of space of the indexes for the first 4 sizes of the datasets. In figures 13(a) and 14(a) we have the speed and the number of distances for the smallest indexes. We can see that the NANNI is the fastest and all our indexes have a much better speedup than the others. Also they are quite competitive in computed distances. In Figures 13(b) and 14(b) DMANNI and TMANNI are the fastest, with the MANNI on third place followed closely by the EPT. The MANNI family is also very competitive in the number of distances computed. Now, on figures 13(c) and 14(c) the DMANNI and TMANNI are again the fastest, with the EPT catching up. Note how these two indexes are very fast although they compute more distances than the EPT, KVP and BNC. On very high memory, Figure 13(d), EPT finally becomes the fastest index followed by the TMANNI but they are not very different from the sequential search. This can be frustrating, however, costly distance functions can take advantage of this setups. However, as the case of our benchmark, they become useless for low cost metrics. Look how in Figure 14(d) the other indexes compute less distances than the MANNI family. Note also that the TMANNI computes more distances than the others but is one of the fastest. Summary. If we rank the results by the time spent independent of the memory of the index, we found that the fastest indexes configurations have a small size of 1 to 4 pivots per element of the database. That is true for every size of 26

the dataset. The fastest indexes are the NANNI, TNANNI, and DMANNI with speedups in the range of 4 to 5 for the size of 105 ; from 5.63 to 7.23 for the size of 3 × 105 ; 9.61 to 10.7 for 106 ; and 12.9 to 17.11 for the size of 3 × 106 . Note how the convenience of using an ANNI index increase with the size of the database. That is not the case for every other index, see for example the EPT or SAT of Figure 13(a), their speedup is constant for every size of the dataset. The LC is also good and improves as n increases. Notice that this is the worst case for LC since its construction time becomes overwhelming large. Notice how as n increases the behavior of some indexes varies significantly. This is the case of VPT, which shows a great speedup (Figure 13(a)), mainly because it has a good tradeoff between simplicity and pruning power. Remarkably, VPT has a light weight construction, in opposite to LC. Notice that the VPT’s search cost measured in terms of evaluated distances is not among the best ones (Figure 14(a)). If we count computed distances, EPT is the best, followed by the MANNI family. 4.4.2. Large databases Results for 3×106 , 107 , 3×107 , and 108 sized databases. Figure 15 shows the cost of the queries for the indexes having at most 12 integers per element. The number of distance computations is shown in Figure 15(a), here we see that the EPT is the best followed very closely by DMANNI, also TMANNI, NANNI, and MANNI are close. All the indexes exhibit a linear growth. For the speedup, the results are different, this is shown in Figure 15(b). The NANNI clearly separates from the others. It is the fastest and has a non linear growth. This is because there is no sequential scan of the objects for discarding them in the NANNI index. Also, the DMANNI is very good, only surpassed at the end by the VPT. The VPT, that had a rather poor performance in all the other experiments, finally shines because of the combinations of its simple structure, the very large size of the database and a simple distance function. Summary. This set of experiments is devoted to describe the performance as a function of the size of the dataset. Note how, consistently, the NANNI is the fastest index with speedups of 23.45 for 107 , 35.51 for 3 × 107 , and 63.26 for 108 items on the database. It is followed by DMANNI using 2 and 4 groups. The VPT reaches a far second place with speedups of 8.03 for 107 , 18.67 for 3 × 107 , and 30.33 for the size of 108 . The smallest versions of the indexes are the fastest. The case of the EPT is noticeable because it computes the less distances but it is very slow in comparison to NANNI. Now, it is interesting to take a closer look at EPT and the NANNI and MANNI indexes; since we learned from the experiments that they will be the choice for costly distance functions. Both of them use partitions of the dataset, the pivots of NANNI take only the closer points, and the pivots of EPT also choose the fartest. Both indexes can be seen as a set of layers, the EPT has pivot groups [23], the MANNI has ANNIs, but the layers of EPT have the same structure; meanwhile the MANNI is composed by a NANNI and several ANNIs. The search in the EPT requires each object in the database to be individually 27

SAT EPT VPT

18.00 16.00 14.00 12.00 10.00 8.00 6.00 4.00 2.00 0.00 1e+05

LC DMANNI TMANNI

LAESA BNC EPT

NANNI TNANNI MANNI

speedup

speedup

LAESA BNC DiSAT

3e+05

1e+06

3e+06

9.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 1e+05

KVP DMANNI TMANNI

3e+05

n

LAESA SSS BNC

MANNI

3e+05

3e+06

(b) 10 to 30 integers

speedup

speedup

4.00 3.50 3.00 2.50 2.00 1.50 1.00 0.50 0.00 1e+05

KVP DMANNI TMANNI

1e+06 n

(a) up to 10 integers LAESA BNC EPT

MANNI

1e+06

3e+06

n

2.00 1.80 1.60 1.40 1.20 1.00 0.80 0.60 0.40 0.20 1e+05

EPT KVP DMANNI

TMANNI MANNI

3e+05

1e+06

3e+06

n

(c) 30 to 100 integers

(d) 100 to 300 integers

Figure 13: Search speedup for increasing N , for indexes using up to 300 integers per item. Synthetic datasets (RVEC-12). Higher is better.

28

SAT EPT VPT

8.0e+05 7.0e+05 6.0e+05 5.0e+05 4.0e+05 3.0e+05 2.0e+05 1.0e+05 0.0e+00 1e+05

LC DMANNI TMANNI

LAESA BNC EPT

NANNI TNANNI MANNI

KVP DMANNI TMANNI

MANNI

2.5e+05 2.0e+05 # distances

# distances

LAESA BNC DiSAT

1.5e+05 1.0e+05 5.0e+04

3e+05

1e+06

0.0e+00 1e+05

3e+06

3e+05

n

(a) up to 10 integers

2.2e+04 2.0e+04 1.8e+04 1.6e+04 1.4e+04 1.2e+04 1.0e+04 8.0e+03 6.0e+03 4.0e+03 2.0e+03 0.0e+00 1e+05

KVP DMANNI TMANNI

LAESA SSS BNC

MANNI

3e+05

3e+06

(b) 10 to 30 integers

# distances

# distances

LAESA BNC EPT

1e+06 n

1e+06

3e+06

n

1.6e+04 1.4e+04 1.2e+04 1.0e+04 8.0e+03 6.0e+03 4.0e+03 2.0e+03 0.0e+00 1e+05

EPT KVP DMANNI

TMANNI MANNI

3e+05

1e+06

3e+06

n

(c) 30 to 100 integers

(d) 100 to 300 integers

Figure 14: Search cost performance comparison for increasing N (RVEC-12), for indexes using up to 300 integers per item. Lower is better.

29

1.7e+07 8.4e+06 4.2e+06 2.1e+06 1.0e+06 5.2e+05 2.6e+05 1.3e+05 6.6e+04 3.3e+04 1.6e+04 3e+06

SAT EPT VPT

DMANNI TMANNI NANNI

LAESA BNC DiSAT

MANNI

SAT EPT VPT

DMANNI TMANNI NANNI

MANNI

70.00 60.00 speedup

# distances

LAESA BNC DiSAT

50.00 40.00 30.00 20.00 10.00

1e+07

3e+07

1e+08

n

0.00 3e+06

1e+07

3e+07

1e+08

n

(a) Number of evaluated distances. Lower is better

(b) Speedup. Higher is better

Figure 15: Search cost performance comparison for increasing N (RVEC-12), the indexes are segmented into four classes of memory usage. Datasets of 3, 10, 30, and 100 million items are used. All indexes have use at most 4 instances (12 integers)

checked for discarding; the search in the MANNI can discard regions of points. This marks the difference in speedup. Another difference is the simpler implementation of an ANNI-based index compared with the EPT. More detailed, EPT’s construction lies in the estimation of random variables for both the dataset and the queries allowing to estimate the search cost. In this regard, an ANNI directly measures the search cost of the training set. 4.4.3. Construction and Speedup From a practitioner point of view, the essential aspect of an index is its speed. The entire system will be dragged by a slow index (or no index at all) or will be lifted by a speedy index. However, the search life of an index (the expected number of queries over a fixed set of data) must be taken into account. Since the construction cost should be amortized over the search life of the index, in some scenarios a costly construction is not an option. For large database sizes the index could not even be constructed in due time. To put our contribution in perspective with respect to other state of the art alternatives, with comparable total speed, we will experimentally analyze the construction times. The idea of these experiments is to increase the insight about both the relation between construction cost and the search performance. The hope is to help potential users to decide what to apply in a given dataset. We applied a trivial parallelization (with a shared-nothing scheme). We simple added more threads for each independent process, those cases are SSS, LAESA, EPT, NANNI, TNANNI, MANNI, DMANNI and TMANNI. Their construction algorithms allowed the use of as many threads as the number of pivots per item. On the other hand, the LC, SAT and KVP were constructed using a single thread because it is not trivial to parallelize them. Finally, the preprocessing time of LC takes into account the number of times the index needs to be constructed to obtain the best parameter combination for a given database.

30

index LC 64 VPT SSS 0.5 BNC 8 LAESA 8 SAT KVP 4 EPT 8 NANNI TNANNI MANNI 2 MANNI 4 MANNI 8 DMANNI 2 DMANNI 4 DMANNI 8 TMANNI 2 TMANNI 4 TMANNI 8

speedup 12.00 2.17 2.96 3.20 2.63 3.23 5.40 10.63 9.70 3.90 11.31 8.63 4.96 8.70 10.81 11.68 7.16 7.90 8.93

construction (s) 30344.85 27.70 840.02 48.56 4.51 824.78 1587.49 2172.20 12436.46 1014.42 3976.15 1043.22 487.79 2829.77 3459.95 3605.05 2006.01 1721.82 1904.41

Table 1: Construction time and speedup for Wiktionary in a collection of indexes. The three best results are marked.

In other words, if the optimal ratio is determined to be n/m = 64 thus we needed to probe for n/m = 1024, 512, 256, 128, 64, and 32. The last step (i.e., m/n = 32) is needed because the search stops when the parameter causes the index to slowdown. We did not include a search in the interval between [63, 33], because this would increase significantly the construction costs. Table 1 shows the speedup and construction time for nearest neighbor queries in the Wiktionary database. As expected, the construction time of LC is prohibitively expensive for a large database. The second to last is NANNI, this is expected since the optimal number of centers is related to the optimal m/n parameter of the LC. It is however noticeable that it needs a single instance, and hence it is considerably faster to build than LC. On the other hand LC is the fastest; but the problem is that it needs a large search life to amortize the construction cost. Other comparable indexes are MANNI, DMANNI and TMANNI. An interesting result arises on DMANNI 8 because it is constructed many times faster than LC and searches are almost as fast as LC. Notice that from those three indexes of the MANNI family we present three instances, all of them are fast. However, it remains an open problem to select the correct memory allowed automatically. Notice that even if we had tried three instances of MANNI (i.e., 2, 4 and 8) we would end up making one third of the cost of LC. Table 2 shows the construction and speedup performances in the RVEC-121M dataset. Again LC has the slower construction. In contrast to the Wik-

31

index LC 64 VPT SSS 0.5 BNC 8 LAESA 2 SAT KVP 2 EPT 4 NANNI TNANNI MANNI 4 MANNI 2 DMANNI 4 DMANNI 2 TMANNI 4 TMANNI 2

speedup 8.67 2.56 0.61 0.92 0.93 1.97 2.15 2.94 10.70 8.21 3.23 6.67 6.86 9.61 4.65 6.40

construction (s) 12485.90 17.28 109.34 8.91 0.34 149.97 216.54 940.35 870.88 528.50 110.40 188.29 494.50 641.96 274.63 301.95

Table 2: Construction time and speedup for RVEC-12-1M. The construction of each index is allowed to use at most many threads as references per item has the index. The best three results are marked.

tionary, LC is not the fastest solving queries. This position is for NANNI, which also has a small construction time. As in the previous experiment, the DMANNI indexes are faster than MANNI and TMANNI, with a reasonable preprocessing time. Finally, Table 3 shows a comparative of construction and speedup performances for larger databases (from 3 to 100 million items). In these cases, we avoid the test of LC, SAT, BNC, and SSS. The first one, LC, cannot be properly created for the given databases because its preprocessing time is prohibitively large. On the other hand, the rest of the indexes (SAT, BNC, and SSS) have little filtering power and the search speed is comparable to the sequential scan. index VPT KVP/2 EPT/4 MANNI/4 DMANNI/4 TMANNI/4 NANNI

3 × 106 speedup construc. 5.07 85.10 1.87 762.83 3.20 2926.22 3.48 322.45 8.65 1634.77 6.68 1054.33 17.11 4746.09

speedup 8.03 2.42 3.35 5.87 15.69 6.87 23.45

107 construc. 999.11 2482.13 18298.50 1772.35 7396.48 3416.31 18013.59

3 × 107 speedup construc. 18.67 7291.36 2.49 10064.21 3.52 94019.37 4.81 3655.37 18.17 21917.46 7.68 8514.90 35.51 63381.12

speedup 30.32 2.48 3.61 8.36 27.89 12.32 63.26

108 construc. 40400.25 43331.18 357873.98 18511.84 96508.12 51430.25 239106.84

Table 3: Construction time and speedup for RVEC-12 over 3, 10, 30, and 100 million datasets. In this experiment, all indexes use at most four pivots per item, mainly to maintain a small memory footprint as it is mandatory in large databases. In the case of KVP this means two close and two far pivots. Construction time is in seconds. The best result per column is marked.

32

The EPT have the largest construction costs, this is because its construction algorithm estimates variables that converge slowly, which can be problematic on large databases. On the other hand, DMANNI is faster, up to 27.89 times faster than sequential scan. It is followed by TMANNI, yet with a significant speed difference. 5. Conclusions and Perspectives In this manuscript we introduced the singleton indexes which allow selfoptimization, accepting as parameter the amount of memory available for indexing, and using as prior knowledge only the query distribution. This last requirement could be partially fulfilled using a sample of the database; which assume the query set distributes as the dataset. Our approach is oriented towards applications; hence, even if programming our indexes need some technical skills, the end user will require no additional knowledge to have an index consistently faster than the sequential scan. Our experimental results show how the search cost depends of the complexity model, i.e. the number of distances computed or the elapsed time. In general, optimizing for one criterion does not work for the other. Fortunately, the indexes designed to have low complexity in both measures, like MANNI and NANNI, have excellent performance in both setups, this demonstrates that it is possible to optimize for distances and obtain also a speedup increase in total query time in most situations. This approach also has the advantage of being more stable than directly measuring total elapsed time, because it can be affected by a lot of factors in real-world computer systems. As a rule of thumb, we suggest to use from two to four pivot groups in a {D, T}MANNI based index, or NANNI if the construction time can be amortized over a large set of queries. However, determining the right memory setup is an open problem. In all the experiments, the LC had a small memory footprint and fast searches. It seems an adequate index for most tasks; however, the major drawback of LC is the construction time. Finding the correct parameters for a given metric space implies making several tries, mainly because there is no analysis for parameter selection in the LC. Moreover, each try in parameter selection is an entire index, and each index construction takes almost quadratic time. If the lifetime of the index is short, then the amortized cost could be higher than using a cheaper-to-build index or no index at all. Our approach is consistently faster to build than the state of the art LC in a sequential environment. Unlike the LC, our index consists of independent parts, there are no race conditions to consider in a parallel construction. This contrast with the construction of the LC, where the current center selection depends on all the previous selections; making the parallel construction non-trivial. There are small differences among our indexes. MANNI assumes independence between routing groups (groups of pivots or centers) while {D,T}MANNI does not. The former indexes optimize all the groups at once, this essentially implies {D,T}MANNI are more robust indexes. The total search time is noisy in

33

a multiuser environment, adding groups of routing objects alleviate this. However, the MANNI indexes are simpler and with low internal complexity, they converge to similar optima when considering time and distances. 5.1. Comments on Disk-based MANNI Indexes The focus of this manuscript is main memory, that is, a single tiered memory scheme. We acknowledge that in modern computers there is a memory hierarchy4 , where each level is larger and slower than the previous one. In this discussion we will assume a two tier architecture, with main and secondary memory. Our goal is to present some ideas and design principles which will allow the implementation of our indexes for very large databases; notice, however, that we have not implemented them. The implementation of a disk-based extension is beyond the scope of this paper. In the literature, metric indexes in secondary memory follow the two tiered schema of main memory and disk. Ciaccia et al. [11] introduced the M-Tree, a dynamic, disk-based metric index. It can be seen as a compact partition index. However, the dynamic operations are the distinguishing property of the MTree, this algorithm takes similar decisions than B-Tree or R-Tree [12]. Skopal introduced the PM-Tree in [25]. The PM-Tree is an M-Tree enriched with a set of pivots. It significantly improves the search performance of the M-Tree, at the cost of extra complexity on the pivot operations and extra memory to store a pivot table. Another approach of secondary memory indexes is M-Index presented by Novak et al [19]. The M-Index consists of a compact partition index stored in a B+-tree. The idea is that centers of regions are assigned to integers, distant enough among them to store all items in each region, and each item is encoded as an integer based on its closest center. This strategy resembles iDistance [14], which is designed for multidimensional data. We consider three cases in our discussion: i) the case where the index and the pivots can be stored in memory, ii) the case where the index can be in memory but the pivots and the dataset must be stored on the disk, and iii) the case where everything must be stored in secondary memory. In the former case, the index and all pivots can be stored in memory, but the database must be on the disk. Since our index has a small memory footprint, this could be the case in practice. This approach simplifies the index since it does not require a significant modification. However, there exist important conditions that need to be kept. Please recall that MANNI is composed of a node based index and ` − 1 ANNI indexes. The NANNI induces a permutation of the dataset since it stores a list of compact regions, i.e., the set of items being assigned to the same pivot/center. The dataset must be stored in the same order induced by the leader to reduce the number of random access to the 4 In modern hardware there is hierarchy of memory, from the traditional magnetic disk storage, solid state-based storage, hybrid storage, distributed storage, etc. In this discussion we focus on traditional magnetic disks, where random accesses are slow operations due to the internal mechanical movements, and sequential accesses are fast.

34

items(p1 ) → u11 , u12 , · · · , u1n1

actual pivots → p1 , p2 , . . . , p`m items(p1 ) → u11 , u12 , · · · , u1n1

.. .

.. .

items(pm ) → um1 , um2 , · · · , umnm

items(pm ) → um1 , um2 , · · · , umnm (a) The index and the pivots are stored in RAM and the dataset in secondary memory. Each set of items is accessed sequentially.

(b) The index is in main memory plus pivots and the dataset in secondary memory. The pivots are stored as a sequential list.

actual pivots → p1 , p2 , . . . , p`m NANNI’s cov → cov(p1 ), cov(p2 ), . . . , cov(pm ) p1 → COL(um1 ), . . . , COL(umnm ); u11 , . . . , u1n1 .. . pm → COL(um1 ), . . . , COL(umnm ); um1 , . . . , umnm (c) The index, the pivots and the dataset are stored in secondary memory. Here, ni is the number of items in the region defined by pi , i.e., ni = | items(pi )|. The ` − 1 ANNI’s are encoded per item as COL(u), i.e., a column of Figure 4.

Figure 16: Three sketches to solve the three cases of disk-based MANNIs. On the top-left, the storing order; each entry represents the items in the region induced by pi . On the right, the order of the pivots on disk; basically, a large list of explicit objects. On the bottom, everything is on disk. For simplicity, each of the NANNI and ANNI contains m pivots.

35

disk. One possibility is to proceed in the same way than the M-Index, or with an inverted index structure. Figure 16(a) illustrates how the dataset should be stored; the MANNI will follow its normal organization (Figure 4). The idea is that items in the same region are accessed in a single random access. In the second case, the index is the only structure allowed to be in memory. This is similar to the previous one with the pivots stored on disk. We must ensure that the distances d(q, p) for all pivots p in all parts of MANNI should be computed using a single random access. For this matter, all pivots need to be stored sequentially, no matter if they belong to the leader or to the ANNIs. A consisting pivot identifier must be globally ensured. Figure 16(b). On the last case, none of the structures can be stored in main memory. This is an extreme case. Firstly, as the second case, we need to store all pivots sequentially and also the covering radii. All of these should be done in one random access. Such that d(q, p) is evaluated for all pivots, and all regions are also selected to be visited. As case 2, all regions are stored, but we also store all ANNI cells related to items in the region together with the actual items. Figure 16(c) illustrates this new arrangement. Despite the cumbersome details, this secondary memory index is just a MANNI taking care of the number of random accesses. Even when these approaches are promising it is necessary to notice the assumption that the structure is already optimized. This is necessarily easy, we foresee a non-trivial problem for the construction in secondary memory. It is important to determine if our optimizing process can be successfully applied to secondary memory indexes. Also, our comments are too general since a lot of details appear in disk-based indexes. The proper study of a disk based MANNI is left for further research. Acknowledgements We express our gratitude to the anonymous referees for the comments and suggestions which helped to improve this presentation. We also want to thank Nora Reyes and Natalia Miranda for pointing out some inconsistencies in the specification of the algorithms in early stages of this manuscript. [1] Amato, G., Gennaro, C., Savino, P., 2014. Mi-file: using inverted files for scalable approximate similarity search. Multimedia Tools and Applications 71 (3), 1333–1362. [2] Bustos, B., Navarro, G., Ch´avez, E., 2003. Pivot selection techniques for proximity searching in metric spaces. Pattern Recognition Letters 24 (14), 2357–2366. [3] Celik, C., 2002. Priority vantage points structures for similarity queries in metric spaces. In: EurAsia-ICT ’02: Proceedings of the First EurAsian Conference on Information and Communication Technology. SpringerVerlag, London, UK, pp. 256–263.

36

[4] Celik, C., 2002. Priority vantage points structures for similarity queries in metric spaces. In: EurAsia-ICT ’02: Proceedings of the First EurAsian Conference on Information and Communication Technology. SpringerVerlag, London, UK, pp. 256–263. [5] Celik, C., 2008. Effective use of space for pivot-based metric indexing structures. In: SISAP ’08: Proceedings of the First International Workshop on Similarity Search and Applications (sisap 2008). IEEE Computer Society, Washington, DC, USA, pp. 113–120. [6] Chavez, E., Figueroa, K., Navarro, G., 2008. Effective proximity retrieval by ordering permutations. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (9), 1647–1658. [7] Ch´ avez, E., Graff, M., Navarro, G., T´ellez, E., 2015. Near neighbor searching with k nearest references. Information Systems 51 (0), 43–61. URL http://www.sciencedirect.com/science/article/pii/ S0306437915000241 [8] Chavez, E., Ludue˜ na, V., Reyes, N., Roggero, P., 2014. Faster proximity searching with the Distal SAT. In: Proc. 7th International Conference on Similarity Search and Applications (SISAP). LNCS 8821. pp. 58–69. [9] Ch´ avez, E., Navarro, G., July 2005. A compact space decomposition for effective metric indexing. Pattern Recogn. Lett. 26, 1363–1376. URL http://dx.doi.org/10.1016/j.patrec.2004.11.014 [10] Chavez, E., Navarro, G., Baeza-Yates, R., Marroquin, J. L., 2001. Searching in metric spaces. ACM Computing Surveys 33 (3), 273–321. [11] Ciaccia, P., Patella, M., Zezula, P., 1997. M-tree: An efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd International Conference on Very Large Data Bases. VLDB ’97. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 426–435. URL http://dl.acm.org/citation.cfm?id=645923.671005 [12] Cormen, T. H., Leiserson, C., Rivest, R. L., Stein, C. E. L. C., 2001. Introduction to Algorithms, 2nd Edition. McGraw-Hill, Inc., New York, NY, USA. [13] Esuli, A., 2012. Use of permutation prefixes for efficient and scalable approximate similarity search. Information Processing & Management 48 (5), 889–902. [14] Jagadish, H. V., Ooi, B. C., Tan, K.-L., Yu, C., Zhang, R., Jun. 2005. idistance: An adaptive b+-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst. 30 (2), 364–397. URL http://doi.acm.org/10.1145/1071610.1071612

37

[15] Malkov, Y., Ponomarenko, A., Logvinov, A., Krylov, V., 2012. Scalable distributed algorithm for approximate nearest neighbor search problem in high dimensional general metric spaces. In: Proc. 5th International Conference on Similarity Search and Applications (SISAP). pp. 132–147. [16] Malkov, Y., Ponomarenko, A., Logvinov, A., Krylov, V., 2014. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems 45, 61–68. [17] Mic´ o, M. L., Oncina, J., Vidal, E., January 1994. A new version of the nearest-neighbour approximating and eliminating search algorithm (aesa) with linear preprocessing time and memory requirements. Pattern Recogn. Lett. 15, 9–17. URL http://portal.acm.org/citation.cfm?id=176626.176628 [18] Navarro, G., 2002. Searching in metric spaces by spatial approximation. The Very Large Databases Journal (VLDBJ) 11 (1), 28–46. [19] Novak, D., Batko, M., Aug 2009. Metric index: An efficient and scalable solution for similarity search. In: Similarity Search and Applications, 2009. SISAP ’09. Second International Workshop on. pp. 65–73. [20] on Persistent Data Structures by Haim Kaplan. Editors Dinesh P. Mehta, C., Mehta, D. P., Sahni, S., 2004. Handbook of Data Structures and Applications. Chapman & Hall / CRC. [21] Pedreira, O., Brisaboa, N. R., 2007. Spatial selection of sparse pivots for similarity search in metric spaces. In: van Leeuwen, J., Italiano, G., van der Hoek, W., Meinel, C., Sack, H., Pl´aˇsil, F. (Eds.), SOFSEM 2007: Theory and Practice of Computer Science. Vol. 4362 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp. 434–445. URL http://dx.doi.org/10.1007/978-3-540-69507-3_37 [22] Ruiz, G., Chavez, E., Tellez, E. S., 2015. Extreme pivots for faster metric search. Information Systems (Under review) – (–), –. [23] Ruiz, G., Santoyo, F., Ch´avez, E., Figueroa, K., Tellez, E. S., 2013. Extreme pivots for faster metric indexes. In: Brisaboa, N., Pedreira, O., Zezula, P. (Eds.), Similarity Search and Applications. Vol. 8199 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp. 115–126. URL http://dx.doi.org/10.1007/978-3-642-41062-8_12 [24] Samet, H., 2006. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann. [25] Skopal, T., 2004. Pivoting m-tree: A metric access method for efficient similarity search. In: DATESO’04. pp. 27–37.

38

[26] Skopal, T., 2010. Where are you heading, metric access methods?: a provocative survey. In: Proceedings of the Third International Conference on SImilarity Search and APplications. SISAP ’10. ACM, New York, NY, USA, pp. 13–21. URL http://doi.acm.org/10.1145/1862344.1862347 [27] Tellez, E. S., Chavez, E., Navarro, G., 2013. Succinct nearest neighbor search. Information Systems 38 (7), 1019 – 1030. URL http://www.sciencedirect.com/science/article/pii/ S030643791200097X [28] Vidal Ruiz, E., July 1986. An algorithm for finding nearest neighbours in (approximately) constant average time. Pattern Recognition Letters 4, 145–157. [29] Yianilos, P. N., 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In: Proceedings of the fourth annual ACMSIAM Symposium on Discrete algorithms. SODA ’93. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp. 311–321. URL http://dl.acm.org/citation.cfm?id=313559.313789

39