A comprehensive analysis of delayed insertions in metric access methods

A comprehensive analysis of delayed insertions in metric access methods

Information Systems xxx (xxxx) xxx Contents lists available at ScienceDirect Information Systems journal homepage: www.elsevier.com/locate/is A com...

572KB Sizes 0 Downloads 22 Views

Information Systems xxx (xxxx) xxx

Contents lists available at ScienceDirect

Information Systems journal homepage: www.elsevier.com/locate/is

A comprehensive analysis of delayed insertions in metric access methods✩ ∗

Humberto Razente , Maria Camila N. Barioni, Regis M. Santos Sousa Universidade Federal de Uberlândia (UFU), Faculdade de Computação, Campus Santa Mônica, Uberlândia, MG, Brazil

article

info

Article history: Received 13 August 2019 Received in revised form 30 December 2019 Accepted 7 January 2020 Available online xxxx Recommended by D. Shasha Keywords: Metric access methods M-tree Ball-partitioning Metric spaces Short-term memory Forced reinsertion

a b s t r a c t Similarity queries are fundamental operations for applications that deal with complex data. This paper presents MIA (Metric Indexing Assisted by auxiliary memory with limited capacity), a new delayed insertion approach that can be employed to create enhanced dynamic metric access methods through short-term memories. We present a comprehensive evaluation of delayed insertion methods for metric access methods while comparing MIA to dynamic forced reinsertions. Our experimental results show that metric access methods can benefit from these strategies, decreasing the node overlap, the number of distance calculations, the number of disk accesses, and the execution time to run k-nearest neighbor queries. © 2020 Elsevier Ltd. All rights reserved.

1. Introduction Nowadays, we interact with several online systems that collect complex data from which it is meaningful to search by similarity [1], such as vectorial data, multimedia databases, time-series, geographic coordinates, sensor data, etc. In this context, efficient strategies to store, to organize, and to retrieve these data are desirable. The structures that allow indexing and fast retrieval of complex data by similarity are known as Metric Access Methods (MAM). They are designed to reduce the number of distance calculations and the number of disk accesses when processing similarity query operations. There have been several research works aiming to allow efficient similarity search on large datasets, such as [1–9]. One of the key aspects to consider for the efficient processing of similarity queries in ball-partitioning MAM is related to the index construction. Trees with a high overlap rate among nodes may reduce the efficiency of similarity query processing. In general, the distribution of data changes over time in a dynamic scenario of data generation [10], which contributes to the growth of the coverage radii and to the increase in the inner nodes overlap. ✩ This work has been supported by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – Finance Code 001, and by the Brazilian National Council for Scientific and Technological Development (CNPq). ∗ Corresponding author. E-mail addresses: [email protected] (H. Razente), [email protected] (M.C.N. Barioni), [email protected] (R.M. Santos Sousa).

Different approaches were explored in related works to overcome these issues. Among them, it is important to mention dynamic reinsertions [11] and parallel dynamic batch loading [12] proposed for the M-tree. These approaches select instances to be removed and reinserted in the tree when a leaf node is about to split. Furthermore, the goal of the latter work is to avoid synchronization problems by postponing reinsertions that are later inserted sequentially. There are also several node split strategies [6,3,2] aiming at decreasing the overlap among nodes. Moreover, there are static solutions, such as the bulk load [13,14] and the slim-down algorithm [3]. The former solution can be employed to recreate a more compact tree, while the latter solution can be used to reorganize the leaf pages of an existing tree. The use of global pivots was evaluated to allow better pruning during searches [15, 7], as well as local pivots [8]. Although all these related works have shown to improve the MAM performance, none of them considered delayed insertions of entire leaf nodes. This article presents a new indexing technique based on the hypothesis that allowing to delay the insertion of data elements into a permanent node can lead to the construction of more efficient MAM. This new technique, called MIA (Metric Indexing Assisted by auxiliary memory with limited capacity), employs a short-term memory when processing insertion operations in a MAM, and it does not require to store new information in the tree nodes. The proposed technique can be applied to both metric or multidimensional structures, such as M-tree [2], R-tree [16], Slim-trees [3], and PM-trees [7]. Although a preliminary version of the work described herein was presented in a previous paper [17], we explore new aspects

https://doi.org/10.1016/j.is.2020.101492 0306-4379/© 2020 Elsevier Ltd. All rights reserved.

Please cite this article as: H. Razente, M.C.N. Barioni and R.M. Santos Sousa, A comprehensive analysis of delayed insertions in metric access methods, Information Systems (2020) 101492, https://doi.org/10.1016/j.is.2020.101492.

2

H. Razente, M.C.N. Barioni and R.M. Santos Sousa / Information Systems xxx (xxxx) xxx

in this article. The present article integrates the concepts introduced in the previous paper and extend them through the presentation of a more detailed description of the proposed method and the results obtained through a new and exhaustive set of experiments. MIA is comparable to dynamic reinsertions [11,18] as a limited amount of auxiliary memory is used in both methods. Thus, the work presented considered the evaluation of both strategies implemented in the M-tree [2], the landmark of the ball-partitioning MAM. Experimental results with several real datasets show that MAM built with MIA presents a better tradeoff between construction costs and querying costs when compared to forced reinsertions. The main contributions of this article can be summarized as follows:

• MIA technique: a detailed description of a new indexing algorithm that allows enhancing both the MAM construction and the similarity querying operations. MAM built with the MIA technique present lower overlap on the inner nodes and allows faster execution of similarity queries; • A comprehensive analysis of delayed insertions strategies: the experiments compared MIA and forced reinsertions with the baseline M-tree. The study was guided by four research questions that allowed to draw a detailed comparison among the effort to build indexes, the index overlap factor, the performance of similarity queries, and the effect of the user-defined parameters on the evaluated methods. We organized this article as follows. In Section 2, we describe the background. Section 3 details the proposed technique. Section 4 discusses the evaluation and experimental results. Section 5 presents the final considerations. 2. Background A similarity query retrieves a set of instances from a dataset ordered by their distances to a given query instance. Metric access methods are data structures created to optimize the similarity queries processing avoiding the cost of the sequential scan. Nowadays, there are several MAM described in the scientific literature. An extensive review can be found at [19]. A landmark among them is the M-tree [2], a dynamic disk-based balanced ball-partitioning hierarchy, built in a bottom-up fashion, such as the B+tree. Several works were proposed to enhance the M-tree performance. They introduce new indexing algorithms, such as the use of minimum spanning trees to split nodes [3], the relaxation of the height-balancing of dense regions [20], the forced reinsertions and leaf selection strategies [11,18], and the use of global pivots to define cut-regions [7]. The properties of metric spaces are the bases for the optimization of MAM [19]. A metric space is a pair ⟨S, δ ()⟩, where S is a data domain, and δ () is a distance function that satisfies the following axioms for any element of x, y, z in S: δ (x, x) = 0 (identity); δ (x, y) = δ (y, x) (symmetry); 0 ≤ δ (x, y) < ∞ (nonnegativity); and δ (x, y) ≤ δ (x, z) + δ (z , y) (triangle inequality). Among these properties, the triangle inequality is particularly essential to discard branches of the hierarchy that indeed do not contain instance answers during the MAM traversal to solve a search operation. Briefly, ball-partitioning hierarchies are composed of inner nodes and leaf nodes. An inner node contains a set of pairs of the form ⟨piv ot , radius⟩, where piv ot is a data element, and radius is the subtree covering radius (all elements in the subtree are at most the distance radius from piv ot). Each one of these pairs defines a ball that covers all the data elements in the tree branch it represents. The triangle inequality is essential to prune the search space by discarding branches of the index hierarchy, i.e., it

determines if there is an intersection between two balls. The leaf nodes contain the data elements. The tree is created with an empty leaf node, that is also the root. An insertion occurs in a leaf node. When a leaf is full and receives a new element, a split algorithm is used to create a new node and to distribute the elements between them. The upper levels are updated recursively. One such mechanism, known as bottom-up construction, guarantees that the structure is always balanced. The insertion of an element in a leaf node may result in an overflow. In this case, a new node is created and a split algorithm is employed to distribute the elements between them. The work [2] proposes the use of mM_RAD, an algorithm that finds a pair of representatives that splits the node, minimizing both covering radii. Its time complexity is O(n2 ), where n is the number of elements of a node. The work [3] proposed the use of MST that generates a minimum spanning tree and removes its longest edge, resulting in two clusters of elements that become nodes. Its time complexity is O(n · log n) on the number of elements of a node. The insertion process can result in a significant overlap among nodes regarding both the leaf and the inner levels. The index traversal may not be able to prune overlapped nodes. Therefore, they decrease the MAM efficiency to run similarity queries. Among the strategies to dynamically fine tune metric access methods, the forced insertions strategy plays an important role, as presented in the following section. 2.1. Forced reinsertions The dynamic nature of metric access methods allows indexing large databases by similarity while it supports arbitrary insertions and deletions. However, the performance can deteriorate over time. The deterioration can be higher in spaces with a high (intrinsic or explicit) dimensionality, due to the ‘‘curse of dimensionality’’ [21]. To deal with this issue, forced reinsertions [11,18] allow better node occupancy and consequently result in more compact nodes. When the insertion of an instance causes an overflow in a leaf node, the reinsertion of a few elements from this leaf allows decreasing its radius. However, they may increase the radii of the leaves that will receive these elements. It is an opportunity to move outliers to suitable leaves. Thus, the furthest elements from the representative routing entry are removed from the leaf and are kept in an auxiliary main memory stack. If the radius of the leaf node decreases, the radius of the parent entries can decrease, up to the root. Then, the stack entries are reinserted until the stack is empty. As reinsertions can also cause reinsertion attempts, the user-defined recursion depth parameter limits the chain of subsequent reinsertions. The parameter kFR defines the number of entries to be removed from a leaf for reinsertion. The instances are removed in descending order of distance from the routing instance. The experiments presented in [11] allowed to empirically define kFR = 5 and recursion depth = 10 as default values (best tradeoff between construction and query costs). Four strategies were defined to remove the entries:

• Pessimistic: it supposes the new instance would be reinserted in the same leaf that was about to split. Thus removal stops after the new element is processed, or it allows the leaf to split if the new instance is the farthest in the node; • Optimistic: after inserting the new instance in the leaf node, removes kFR instances for reinsertion and update the node radius; • Rev_pessimistic and Rev_optimistic are similar to the former strategies, although the instances from the stack are reinserted in ascending order of distance.

Please cite this article as: H. Razente, M.C.N. Barioni and R.M. Santos Sousa, A comprehensive analysis of delayed insertions in metric access methods, Information Systems (2020) 101492, https://doi.org/10.1016/j.is.2020.101492.

H. Razente, M.C.N. Barioni and R.M. Santos Sousa / Information Systems xxx (xxxx) xxx

3

The M-tree standard insertion process starts finding out a path from the root to a leaf. In many cases, more than one path can be chosen if the inserted element is covered by the radii of more than one branch representative. The Single-way leaf selection works as follows. At each level, if there is a branch of the tree that covers the new instance, it must be followed. If multiple branches qualify, choose the closest representative (mindist), or the one with the minimum occupancy (minoccup) [3]. In the upper levels of the tree, the high overlap can disturb the selection of a path, possibly increasing a leaf radius. To deal with this issue, [22] proposes to find the optimal path. The Multi-way leaf selection employs a point query to find the candidate leaves that cover the new instance and selects the closest leaf node that can accommodate the instance without splitting. Despite the higher construction cost, the method increases the performance of similarity queries. To deal with this construction cost, the Hybrid-way leaf selection [18] selects a limited set of candidates at each level, reducing the complexity of the Multi-way leaf selection, possibly finding the optimal path.

radius) that covers all the data elements in the tree branch it represents. A leaf node contains a set of entries in the form ⟨id, Si , δ (Si , Srep )⟩, where id is a key, Si is a data element, and δ (Si , Srep ) is the distance to the node representative Srep (stored in the upper level). The leaf nodes contain all data elements. Our strategy is based on the assumption that by delaying the insertion of an element into a permanent node (disk page), it may be grouped with other more similar elements that will be inserted, thus contributing to minimizing the overlap among nodes. Therefore, it performs the insert operation in the following manner. If the insertion in the leaf node increases its covering radius, the new algorithm inserts the element into the shortterm memory, delaying the persistence of the element in the tree, which would cause an immediate overlap increase. The short-term memory is a main memory-based data structure where its size is user predefined. A redo log file is employed to persist the inserted data elements temporarily to enable fault tolerance. It is important to note that future processing will be performed to add the elements in the tree hierarchy.

2.2. Measuring the overlap

3.1. Insert

The MAM evaluation methodology is based on the tradeoff between the time spent to build an index and the time spent to run a query. A valuable tool that allows us to analyze the reasons why a building strategy results in better indexes is the computation of overlap measures. Considering it is not possible to compute the volume of the intersections of generic metric spaces, we computed the relative fat-factor [3], a measurement based on counting the elements in the intersections of overlapped leaf nodes. Let node1 and node2 be two index entries. The overlap of node1 and node2 is defined as the number of elements in the corresponding subtrees that are covered by both regions, divided by the number of elements in both subtrees. For the comparison of different trees storing the same data in structures with a different number of nodes, it was proposed a measurement that takes into consideration a normalized equation based on the minimum theoretical tree, the relative fatfactor. The relative fat-factor of an index T takes into consideration the minimum theoretical number of nodes (Mmin), the minimum theoretical height (Hmin), the number of nodes read to answer a point query Ic, the height H, the number of nodes M, and .N 1 the total number of elements N: fatrel (T ) = Ic −Hmin . N Mmin−Hmin The relative fat-factor will vary from zero to a positive number. It allows the comparison of two trees considering both the number of overlapped elements and the efficient occupation of the nodes, and in general, the smaller the relative fat-factor, the fewer disk accesses are needed to perform queries.

The insert algorithm works as follows. Let si ∈ S be a new data element. Starting from the root node, if there is a node (branch) that covers si , it selects the node. Otherwise, the algorithm chooses the node whose distance from the representative is smaller. If more than one node covers the new element si , the algorithm employs the Single-way leaf selection heuristic. This recursive search is done until it reaches a leaf node. Fig. 1 outlines the complete process. After choosing the leaf node to insert the element si , if the radius of the node does not need to increase and if there is space available, the algorithm adds si in the selected leaf node. If the insertion causes an overflow, it runs a split algorithm (such as the heuristics m_RAD, mM_RAD, M_LB_DIST , RANDOM, or SAMPLING [2]), and the new representatives may be updated in the upper levels, recursively (Fig. 1a). On the other hand, if the insertion of the new element si increases the leaf node radius, instead of inserting it in the leaf node, it is sent to the short-term memory until it reaches its capacity (Fig. 1b). When there is an overflow in the short-term memory, it creates a new leaf node or a set of leaf nodes with clusters of elements removed from the short-term memory. The algorithm creates groups of elements, where each group is limited to the node capacity c (the size of a leaf node minus metadata space) and the occupation rate t parameters. The occupation rate defines the space left in the leaf node to allow future insertions aiming to avoid a split or the creation of a new leaf node overlapped with the current node. The strategy enables a new element that is not covered by a particular leaf node to wait in the shortterm memory to be grouped with other neighboring elements, thus forming a better new leaf. The insert operation of the standard M-tree may lead to a leaf node increase (Figs. 2a and 2b). In this case, the query performance of the structure decreases as the overlap increases because the search will have to traverse more branches. The MIA algorithm, shown in Fig. 2c, allows the elements that would have caused the increase in the radius to form a new node. Therefore, the technique enables creating compact nodes at low processing and memory costs. To insert a new leaf created with a set of elements removed from the short-term memory, we need to find a suitable subtree, starting from the root node until we reach the leaf level. The algorithm works as follows. If there is a node (subtree) that covers the new leaf, the algorithm selects it. Otherwise, if no node covers it, choose the node whose distance to the new leaf is smaller. If

3. Metric indexing assisted by short-term memories This article presents a strategy through which an auxiliary memory of limited storage capacity is used to delay the insertions when building a MAM. It can be employed in the construction of dynamic MAM, such as M-tree, and its descendants. In this work, we chose the MAM M-tree as the baseline. The M-tree [2] is a MAM where each tree node is stored in a fixed-size disk page with bottom-up incremental construction. Let S ⊂ S be a dataset and δ () be a distance function. An inner node contains a set of entries of the form ⟨Srepi , pageID, radius, δ (Srepi , Srep )⟩, where Srepi is a data element that acts as the representative (pivot) of the subtree rooted at pageID, pageID allows the retrieval of the subtree, radius is the branch covering radius, and δ (Srepi , Srep ) is the distance to the node representative Srep (stored in the upper level). Each entry defines a ball (centered at Srepi with radius

Please cite this article as: H. Razente, M.C.N. Barioni and R.M. Santos Sousa, A comprehensive analysis of delayed insertions in metric access methods, Information Systems (2020) 101492, https://doi.org/10.1016/j.is.2020.101492.

4

H. Razente, M.C.N. Barioni and R.M. Santos Sousa / Information Systems xxx (xxxx) xxx

Fig. 1. Overview of the insertion algorithm assisted by the short-term memory.

Fig. 2. Leaf node behavior after inserting an element. (a) Standard M-tree leaf nodes and the new element. (b) The increase of the node radius r to include the new element. (c) New inner node created by processing the short-term memory after the insertion of new elements.

more than one node covers the new leaf, we propose a Singleway leaf selection method based on the minimum distance that receives a leaf node instead of a single element. It chooses the subtree whose radius increase will be smaller. After inserting the new leaf, the insert algorithm promotes the leaf representative element. As in the standard M-tree, if the promotion causes an overflow on the inner node, it runs the split algorithm, promoting the new representatives, recursively. Fig. 3. SM-Random: randomly select an element and add its nearest neighbors.

3.2. Selecting elements from the short-term memory We defined three strategies to get clusters of elements from the short-term memory. It is important to note that the requirement of a low processing computational cost drove the development of these strategies. The goal is to select a representative element (or a set of representative elements) from the short-term memory, which minimizes its sum of distances to the closest elements (limited to the node capacity and the node occupation rate). The challenge is to define how to create a leaf node, considering the tradeoff between the computational cost and the ability to create compact clusters.

3.2.1. SM-Random The SM-Random algorithm randomly selects a short-term memory element as the node representative. The distances from the chosen element to all the elements from the short-term memory are computed and ranked. Fig. 3 presents the intuition of the algorithm. A random element is selected as the representative, and its nearest neighbors are selected until the node capacity is reached. In the example illustrated in Fig. 3, a representative from a not very dense neighborhood was selected. Algorithm 1 selects the most similar elements up to the node capacity and the occupation rate. Its time complexity is O(n)

Please cite this article as: H. Razente, M.C.N. Barioni and R.M. Santos Sousa, A comprehensive analysis of delayed insertions in metric access methods, Information Systems (2020) 101492, https://doi.org/10.1016/j.is.2020.101492.

H. Razente, M.C.N. Barioni and R.M. Santos Sousa / Information Systems xxx (xxxx) xxx

5

Algorithm 2: Short-term memory densest leaf node. Input: The elements m in the short-term memory M (m ∈ M), the node capacity c, the occupation rate t Output: Leaf node node 1 2 3 4

Fig. 4. SM-Density: find the element that minimizes the sum of distances to its nearest neighbors.

5 6 7 8

distance calculations, where n is the number of elements in the short-term memory. A sampling strategy can be employed, i.e., repeat this selection a constant number of times, and return the densest among the evaluated sets.

9 10 11 12

Algorithm 1: Short-term memory random leaf node. Input: The elements m in the short-term memory M (m ∈ M), the node capacity c, the occupation rate t Output: Leaf node node 1 2 3 4 5 6

7 8

Randomly choose an element m ∈ M to be rep for each mi ∈ M do d[i] ← δ (rep, mi ) end Sort vectors d and M according to d in descending order Select the elements from M in backward order until it fills the leaf page according to c and t, node ← select(M , c , t) Remove from M the selected elements Return node

13 14 15

Algorithm 3: Short-term memory cluster leaf nodes. Input: The elements m in the short-term memory M (m ∈ M), the node capacity c, the occupation rate t, the number of clusters kc Output: Set of leaf nodes nodes 1 2 3 4 5

3.2.2. SM-Density The SM-Density algorithm finds the set of elements up to the node capacity and the occupation rate that minimizes the sum of distances to the representative element. Fig. 4 presents the intuition of the algorithm. It allows finding the densest leaf from a set of elements that were not included in other leaves because they were not covered by them. Algorithm 2 details the strategy. Its time complexity is O(n2 ) distance calculations, where n is the number of elements in the short-term memory. The densest set found is selected to create the new leaf node.

3.2.3. SM-Cluster The SM-Cluster algorithm employs a k-medoids clustering algorithm such as CLARANS (Clustering Large Applications based on Randomized Search) [23]. The algorithm sets each medoid found as a leaf node representative, and groups the closest elements from each representative, up to the node capacity and the occupation rate. To define kc as the number of clusters, we employed the ratio between the size of the short-term memory and the leaf node capacity, limited to a constant value, as the algorithm time complexity grows exponentially on the number of clusters. CLARANS time complexity is O(n2 ) distance calculations, where n is the number of elements in the short-term memory. Algorithm 3 describes the steps required by this strategy. Fig. 5 presents the intuition of the algorithm. In this example, three clusters were defined by their medoids, and their nearest neighbors were selected to fill up the nodes.

set radius = ∞ for each mi ∈ M do rep = mi for each mj ∈ M do d[i] ← δ (rep, mj ) end Sort vectors d and M according to d in descending order Select the elements from M in backward order until it fills the leaf page according to c and t, node ← select(M , c , t) if radius > node.radius then set node as densestnode update radius = node.radius end end Remove from M the elements from the densest node Return densestnode

6 7

8 9 10

rep[ ] = clarans(M , kc ) for each repi ∈ rep[ ] do for each mj ∈ M do d[j] ← δ (repi , mj ) end Sort vectors d and M according to d in descending order Select the elements from M in backward order until it fills the leaf page according to c and t, nodes[i] ← select(M , c , t) Remove from M the elements inserted in nodes[i] end Return nodes

Fig. 5. SM-Cluster: select k-medoids, set each medoid as a node center and add its nearest neighbors up to the node capacity.

4. Experiments The access methods employed in the experiments were implemented in C++ [24]. We run the experiments on a Linux 64 bits personal computer with 8 GB of main memory, Intel Core i7−[email protected] GHz, and 1 TB hard disk. A single CPU core was employed. The following configuration parameters were active in the experiments:

Please cite this article as: H. Razente, M.C.N. Barioni and R.M. Santos Sousa, A comprehensive analysis of delayed insertions in metric access methods, Information Systems (2020) 101492, https://doi.org/10.1016/j.is.2020.101492.

6

H. Razente, M.C.N. Barioni and R.M. Santos Sousa / Information Systems xxx (xxxx) xxx Table 2 Time in seconds to build the indexes.

Table 1 Datasets. Dataset Pendigits Eigenfaces Letter SISAP Nasa Corel Covertype SISAP Colors

Number of elements

Number of dimensions

Short-term mem (KB)

Disk page size (KB)

Ref.

10,992 11,900 20,000 40,150 68,040 581,102 112,682

16 16 16 20 32 54 112

32 32 32 40 64 108 224

2 2 2 4 4 8 16

[25] [25] [25] [26] [25] [25] [26]

Dataset

M-tree

SMR

SMD

SMC

M-FR

Pendigits Eigenfaces Letter SISAP Nasa Corel Covertype SISAP Colors

0.658 0.728 1.158 6.536 7.274 150.681 48.324

0.489 0.538 0.977 5.683 5.726 86.215 43.625

6.108 6.486 9.186 14.449 46.042 667.564 109.857

2.824 2.573 4.297 9.415 27.417 412.165 82.971

0.775 0.846 1.380 7.006 8.119 157.213 51.328

Table 3 Number of distance calculations to build the resulting indexes (times 106 ).

• Distance function: Euclidean (L2 ); • Short-term memory: 500 elements; • Forced reinsertions: optimistic, kFR = 5 and recursion depth = 10; • Node split policy: mM_RAD; • Similarity queries: k-nearest neighbors, random selection of 100 query elements;

• SM-Density strategy: optimal (densest) set; • SM-Cluster strategy: CLARANS parameters kc = max(n/c , 5), maxNeigbor = max(50, 0.0125 · (kc · (n − kc ))) and numLocal = 2, where kc is the number of clusters, c is the node capacity (number of elements), and n is the size of the short-term memory. Table 1 presents the datasets evaluated in the experiments. The reasons for the selection of these datasets are twofold: they are among the datasets often used for the evaluation of MAM in related works; and they present different amounts of elements and attributes, which contributes to the evaluation of the compared methods in different scenarios. The fourth column of this table shows the size allocated in KB for the short-term memory (where the size is related to the number of dimensions, and was adjusted to hold 500 data elements), and the fifth column presents the size of the disk pages employed to build the indexes (one tree node per disk page). In our framework, all distances are computed and stored as double-precision floating-points (8 bytes) and vectorial data as regular floating-points (4 bytes). These configurations allowed the creation of trees with 4 levels for Pendigits, Eigenfaces, Letter, Nasa, Corel, and Colors. Covertype trees resulted in 4 levels for M-tree and for M-tree with forced reinsertions (M-FR), and 5 levels for SMR, SMD, SMC. The experiments resulted in the following information concerning the use of short-term memory and forced reinsertions strategies: the time spent to build the indexes; the indexes overlap rate; and the performance of the indexes to run queries. The four questions that guided our analysis are:

• Q 1: Is it possible to create indexes in shorter processing time by using short-term memory or forced reinsertion strategies? • Q 2: Is it possible to build indexes with lower overlap among nodes by using short-term memory or forced reinsertion strategies? • Q 3: Is it possible to run similarity queries faster on indexes built with the aid of short-term memory or forced reinsertion strategies? • Q 4: How do the size of the short-term memory, the number of reinserted elements, and the size of the reinsertion stack affect the performance of the methods? In the next sections, the M-trees built with forced reinsertions are referred to as M-FR, and the M-trees built with the shortterm memory based on SM-Random, SM-Density, and SM-Cluster are referred to as SMR, SMD, and SMC, respectively. In all the following experiments, we highlighted in bold the smaller values.

Dataset

M-tree

SMR

SMD

SMC

M-FR

Pendigits Eigenfaces Letter SISAP Nasa Corel Covertype SISAP Colors

8.0 8.8 13.7 74.1 59.5 883.9 151.6

5.6 6.1 11.1 64.0 45.4 449.7 134.3

71.3 65.5 104.5 151.5 364.9 3314.4 332.5

52.9 47.7 79.1 123.5 287.4 2683.0 276.2

8.4 9.1 14.9 77.4 63.2 886.7 158.4

4.1. Evaluation of the index construction To answer Q 1, ‘‘Is it possible to create indexes in shorter processing time by using short-term memory or forced reinsertion strategies?’’, Table 2 shows the time in seconds to build the indexes. The use of a short-term memory decreased the time spent on indexing the data for SMR and increased the time spent on indexing the data for SMD and SMC when compared to the standard M-tree, as expected. Although forced reinsertions (MFR) increased the time, both SMC and M-FR were built in the same order of magnitude of the standard M-tree. Table 3 presents the number of distance calculations. There is a linear correlation between build time (Table 2) and build distance calculations (Table 3) for each dataset. The ratio between them increases as the dimensionality and the number of elements increase. SMR reduces the time spent to build the indexes because it allows avoiding most leaf splits. For the creation of a new leaf, the time complexity of SMR is O(n) distance calculations (n being the size of the short-term memory). At the same time, for the standard M-tree, the MINMAX split computes O(n2 ) distance calculations (n being the number of elements per node). It is interesting to note that SMC time is related to CLARANS parameters, that were tuned with the default parameters suggested by [23]. Table 4 presents the number of disk accesses to build the resulting indexes (the number of page reads and writes). For all the evaluated methods, as the insert algorithm is recursive, in order to find the path to a leaf, it reads one page per node of the hierarchy. All reads and writes are counted, regardless of whether the pages are stored in some sort of cache. We did not create a cache structure, although all the indexes may benefit from the system or hardware cache (for instance, the hard drive employed to store the experiments had 16 MB of cache). When returning from the recursion, parent nodes are updated, at least with the metadata information of the number of elements in the branch, a radius update, and eventually, a new routing entry due to a node split. When compared to the standard M-tree, the methods based on the short-term memory presented approximately the same number of disk accesses, while we can notice a considerable increase for the forced reinsertion method. 4.2. Analysis of the tree overlap To answer Q 2, ‘‘Is it possible to build indexes with lower overlap among nodes by using short-term memory or forced reinsertion strategies?’’, we computed the fat-factors as presented in

Please cite this article as: H. Razente, M.C.N. Barioni and R.M. Santos Sousa, A comprehensive analysis of delayed insertions in metric access methods, Information Systems (2020) 101492, https://doi.org/10.1016/j.is.2020.101492.

H. Razente, M.C.N. Barioni and R.M. Santos Sousa / Information Systems xxx (xxxx) xxx

7

Table 7 Total time in seconds to run 100 executions of 100-nearest neighbors’ queries.

Table 4 Number of disk accesses to build the resulting indexes (times 103 ). Dataset

M-tree

SMR

SMD

SMC

M-FR

Dataset

M-tree

SMR

SMD

SMC

M-FR

Pendigits Eigenfaces Letter SISAP Nasa Corel Covertype SISAP Colors

85.9 94.8 166.6 288.8 589.6 5199.3 950.6

85.6 94.2 163.6 299.7 575.3 5469.8 951.1

85.2 94.0 164.4 295.7 573.2 5492.9 948.8

85.9 94.2 163.6 296.9 574.2 5493.9 952.6

164.2 186.2 293.1 439.2 970.1 9286.6 1403.9

Pendigits Eigenfaces Letter SISAP Nasa Corel Covertype SISAP Colors

0.067 0.053 0.179 0.237 0.637 1.753 2.370

0.051 0.044 0.151 0.217 0.453 1.095 2.286

0.048 0.048 0.153 0.211 0.440 0.998 2.312

0.050 0.045 0.152 0.216 0.448 1.015 2.240

0.061 0.049 0.177 0.213 0.622 1.249 2.434

Table 5 Relative fat-factor.

Table 8 Average number of disk accesses of 100-nearest neighbors’ queries.

Dataset

M-tree

SMR

SMD

SMC

M-FR

Dataset

M-tree

SMR

SMD

SMC

M-FR

Pendigits Eigenfaces Letter SISAP Nasa Corel Covertype SISAP Colors

0.347 0.328 0.581 0.371 0.542 0.138 0.489

0.205 0.143 0.298 0.223 0.240 0.105 0.394

0.219 0.169 0.295 0.212 0.235 0.110 0.387

0.209 0.147 0.284 0.224 0.232 0.108 0.364

0.289 0.271 0.482 0.259 0.474 0.104 0.485

Pendigits Eigenfaces Letter SISAP Nasa Corel Covertype SISAP Colors

344 263 871 772 2177 4999 2585

236 191 779 746 1630 3144 2510

216 212 762 711 1563 3028 2547

229 195 772 748 1595 3063 2481

309 244 874 707 2147 3316 2686

Table 6 Number of nodes in the resulting indexes. Dataset

M-tree

SMR

SMD

SMC

M-FR

Pendigits Eigenfaces Letter SISAP Nasa Corel Covertype SISAP Colors

581 642 990 1,154 2,951 21,796 3,732

662 694 1,188 1,346 3,423 22,835 3,950

656 712 1,166 1,339 3,439 22,980 4,027

667 688 1,172 1,324 3,434 22,528 3,983

595 639 1,048 1,194 3,061 21,211 3,846

Table 5. We also show the resulting number of nodes in Table 6. From Table 5, it is possible to notice the short-term memory and the forced reinsertions strategy allow the quality of the indexes to increase for all the datasets employed in the experiments. Moreover, the methods based on the short-term memory resulted in smaller fat-factors when compared with M-FR, excepted for the Covertype dataset. For instance, considering SISAP Nasa dataset, the relative fat-factor reduced up to 42.8% for SMR, SMD, and SMC, while it reduced 30.2% for M-FR. Considering SISAP Colors dataset, the relative fat-factor reduced up to 25.5% for SMR, SMD, and SMC when compared to the standard M-tree, while it reduced 1% for M-FR. Table 6 shows the resulting number of nodes of the indexes. For instance, analyzing the results for the Covertype dataset, the number of nodes are 4.8% and 5.4% greater for the SMR and the SMD, respectively, when compared with the standard M-tree. Although the use of the algorithms SMR, SMD, and SMC resulted in trees with more nodes, the nodes are tighter, and the elements are better distributed, as they resulted in trees with smaller relative fat-factors. Forced reinsertion resulted in trees with a slight increase in the number of nodes when compared with the standard M-tree, except for Covertype, where a reduction of 2.7% can be observed. Although forced reinsertion results in a smaller increase in the number of nodes when compared to MIA, it does not decrease the overlap as much as MIA. 4.3. Performance of similarity queries To answer Q 3, ‘‘Is it possible to run similarity queries faster on indexes built with the aid of short-term memory or forced reinsertion strategies?’’, we run 100 k-nearest neighbor queries with k = 100. It is important to notice that to run the queries, as the short-term memory may have data elements, the k-nearest neighbor algorithm described in [27] was adapted to search first the short-term memory.

Table 9 Average number of distance calculations of 100-nearest neighbors’ queries (times 103 ). Dataset

M-tree

SMR

SMD

SMC

M-FR

Pendigits Eigenfaces Letter SISAP Nasa Corel Covertype SISAP Colors

4.5 3.4 15.0 21.6 37.4 59.1 60.0

3.2 2.9 11.0 18.1 24.7 35.6 57.1

3.1 3.2 11.4 17.7 24.3 30.9 57.0

3.1 3.0 11.2 18.1 23.9 31.6 55.9

4.0 3.2 14.1 19.2 36.0 44.9 61.4

Table 7 presents the query processing times obtained in the experiments. Both the short-term memory and the forced reinsertion strategies allowed to reduce the time spent to process k-nearest neighbor queries. For instance, SMD allowed reducing 43% the time for Covertype, while M-FR allowed reducing 29%. Table 8 shows the average number of disk accesses, and Table 9 shows the average number of distance calculations to process the k-nearest neighbor queries. These experimental results are related to the query processing times (Table 7) as the increase in one or both measures directly increase the time. They are also subject to the number of dimensions of each dataset, as the higher the dimensionality, the higher the time to process each distance, and the higher the disk space used. The curse of dimensionality [19] corroborates to the difficulty of searching or mining high dimensional data as the distances between pairs of elements tend to be very similar, decreasing the performance of the algorithms. For instance, one can notice the effect of the dimensionality on time, on the number of disk accesses, and on the number of distance calculations for the SISAP Colors k-nearest neighbor queries. Analyzing the total number of nodes shown in Table 6, although SMR, SMD, and SMC resulted in indexes with more nodes than the standard M-tree, the number of nodes visited during k-nearest neighbor queries is smaller (Table 8). Moreover, the short-term memory strategies also visited fewer nodes when compared to forced reinsertions. Table 9 presents the average number of distance calculations. It is possible to notice the reduction of 47.7% for SMD when compared with the standard M-tree for the Covertype dataset. We can notice there is a linear correlation, for each dataset, of the processing times (Table 7) with both disk accesses (Table 6) and distance calculations (Table 9).

Please cite this article as: H. Razente, M.C.N. Barioni and R.M. Santos Sousa, A comprehensive analysis of delayed insertions in metric access methods, Information Systems (2020) 101492, https://doi.org/10.1016/j.is.2020.101492.

8

H. Razente, M.C.N. Barioni and R.M. Santos Sousa / Information Systems xxx (xxxx) xxx Table 10 Total time in seconds to run 100 executions of 100-nearest neighbors’ queries (sm: short-term memory size; k: entries removed; rd: recursion depth). Dataset

Std

sm = 250, kFR = 5, rd = 10

M-tree

SMR

SMD

SMC

M-FR

SMR

SMD

SMC

M-FR

Pendigits Eigenfaces Letter SISAP Nasa Corel Covertype SISAP Colors

0.067 0.053 0.179 0.237 0.637 1.753 2.370

0.056 0.046 0.164 0.239 0.562 1.541 2.495

0.054 0.050 0.168 0.226 0.566 1.419 2.376

0.054 0.048 0.165 0.253 0.565 1.668 2.457

0.061 0.049 0.177 0.213 0.622 1.249 2.434

0.051 0.044 0.151 0.217 0.453 1.095 2.286

0.048 0.048 0.153 0.211 0.440 0.998 2.312

0.050 0.045 0.152 0.216 0.448 1.015 2.240

0.061 0.051 0.184 0.228 0.644 1.278 2.414

Dataset

Std

sm = 750, kFR = 15, rd = 30

M-tree

SMR

SMD

SMC

M-FR

SMR

SMD

SMC

M-FR

0.067 0.053 0.179 0.237 0.637 1.753 2.370

0.050 0.045 0.146 0.207 0.424 0.892 2.246

0.048 0.049 0.148 0.205 0.405 0.829 2.212

0.049 0.045 0.150 0.213 0.409 0.791 2.125

0.061 0.052 0.183 0.234 0.632 1.259 2.389

0.049 0.046 0.144 0.201 0.398 0.748 2.124

0.049 0.050 0.148 0.202 0.377 0.729 2.143

0.050 0.046 0.147 0.198 0.398 0.698 1.998

0.062 0.051 0.185 0.248 0.607 1.146 2.358

Pendigits Eigenfaces Letter SISAP Nasa Corel Covertype SISAP Colors

4.4. The effect of the parameters: short-term memory, kFR , and recursion depth To answer Q 4, ‘‘How do the size of the short-term memory, the number of reinserted elements, and the size of the reinsertion stack affect the performance of the methods?’’, we stressed the following parameters in the experiments presented in Table 10: the size sm of the short-term memory for SMR, SMD, and SMC, varied from sm = 250 up to 1000; and the number kFR of removed entries from a leaf node that is about to split, varied from kFR = 5 up to 20 and the recursion depth rd, varied from rd = 10 up to 40. As expected, for the short-term memory strategies, the higher the size of the memory, the smaller the time to perform the queries. The reason is that the opportunity to analyze a more significant subset of instances when building the M-tree allows the creation of more efficient structures. Our experiments corroborate that the best values for the parameters kFR and recursion depth are 5 and 10, respectively, as defined in [11]. These values allow an acceptable tradeoff between construction and query processing time for forced reinsertions. Table 10 shows that the short-term memory with 250 elements is small for the SISAP Colors experiment, resulting in a performance that is worse than the standard M-tree. For larger sizes of the short-term memory, the performance was up to 15% better (1.998 s for SMC sm = 1000 against 2.370 s of the standard M-tree). Moreover, forced reinsertion performed better than the short-term memory for Covertype dataset with parameters sm = 250, k = 5, rd = 10. In general, the experiments show MIA allows creating trees with better performance, even when the naive selection of nodes (SMR) is employed, when considering an appropriate short-term memory. 5. Conclusion The development of efficient dynamic metric access methods is fundamental for similarity search. We developed a technique called MIA that employs a limited amount of memory to improve the performance of these methods. MIA is generic and may be used by several ball-partitioning metric access methods and by multidimensional methods. We presented a comprehensive evaluation of delayed insertion methods while comparing MIA to dynamic forced reinsertions on M-tree. We empirically show through an exhaustive set of experiments these methods decrease the growth in the volume of the nodes and the overlap among

sm = 500, kFR = 10, rd = 20

sm = 1000, kFR = 20, rd = 40

nodes. Consequently, they speed up similarity queries over complex data. The experimental results show that the use of MIA outperforms forced reinsertions, building M-tree indexes with lower overlap among nodes, thus, decreasing the time spent to run k-nearest neighbor queries. Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. References [1] G. Navarro, N. Reyes, New dynamic metric indices for secondary memory, Inf. Syst. 59 (2016) 48–78, http://dx.doi.org/10.1016/j.is.2016.03.009. [2] P. Ciaccia, M. Patella, P. Zezula, M-tree: An efficient access method for similarity search in metric spaces, in: Int’l Conf. Very Large Data Bases, VLDB, Athens, Greece, 1997, pp. 426–435. [3] C. Traina, A. Traina, C. Faloutsos, B. Seeger, Fast indexing and visualization of metric data sets using slim-trees, IEEE Trans. Knowl. Data Eng. 14 (2) (2002) 244–260, http://dx.doi.org/10.1109/69.991715. [4] T. Skopal, On fast non-metric similarity search by metric access methods, in: Int’l Conf. Ext. Database Tech., EDBT, LNCS 3896, Munich, 2006, pp. 718–736, http://dx.doi.org/10.1007/11687238_43. [5] Y.N. Silva, W.G. Aref, P.-A. Larson, S. Pearson, M.H. Ali, Similarity queries: their conceptual evaluation, transformations, and processing, VLDB J. 22 (3) (2013) 395–420, http://dx.doi.org/10.1007/s00778-012-0296-4. [6] J. Souza, H. Razente, M.C. Barioni, Optimizing metric access methods for querying and mining complex data types, J. Braz. Comput. Soc. 20 (17) (2014) 14, http://dx.doi.org/10.1186/s13173-014-0017-5. [7] J. Lokoc, J. Mosko, P. Cech, T. Skopal, On indexing metric spaces using cut-regions, Inf. Syst. 43 (2014) 1–19, http://dx.doi.org/10.1016/j.is.2014. 01.007. [8] P. Oliveira, C. Traina, D. Kaster, Improving the pruning ability of dynamic metric access methods with local additional pivots and anticipation of information, in: East European Conf. Adv. Databases and Inf. Syst., ADBIS, LNCS 9282, Springer, Poitiers, France, 2015, pp. 18–31, http://dx.doi.org/ 10.1007/978-3-319-23135-8_2. [9] H.L. Razente, R.L.B. Lima, M.C.N. Barioni, Similarity search through onedimensional embeddings, in: ACM Symp. Applied Computing, SAC, Marrakech, Morocco, 2017, pp. 874–879, http://dx.doi.org/10.1145/3019612. 3019674. [10] J. Gama, A survey on learning from data streams: current and future trends, J. Prog. Artif. Intell. 1 (1) (2012) 45–55, http://dx.doi.org/10.1007/s13748011-0002-6. [11] J. Lokoc, T. Skopal, On reinsertions in m-tree, in: Int’l Workshop on Similarity Search and Applications, SISAP, IEEE, ISBN: 978-0-7695-3101-4, 2008, pp. 121–128, http://dx.doi.org/10.1109/SISAP.2008.10. [12] J. Lokoc, Parallel dynamic batch loading in the m-tree, in: Int’l Workshop on Similarity Search and Applications, SISAP, IEEE, Prague, Czech Republic, 2009, pp. 117–123, http://dx.doi.org/10.1109/SISAP.2009.27.

Please cite this article as: H. Razente, M.C.N. Barioni and R.M. Santos Sousa, A comprehensive analysis of delayed insertions in metric access methods, Information Systems (2020) 101492, https://doi.org/10.1016/j.is.2020.101492.

H. Razente, M.C.N. Barioni and R.M. Santos Sousa / Information Systems xxx (xxxx) xxx [13] P. Ciaccia, M. Patella, Bulk loading the M-tree, in: 9th Australasian Database Conference, ADC’98, Perth, Australia, 1998, pp. 15–26. [14] T. Vespa, C. Traina, A. Traina, Efficient bulk-loading on dynamic metric access methods, Inf. Syst. 35 (5) (2010) 557–569, http://dx.doi.org/10.1016/ j.is.2009.07.002. [15] C. Traina, A. Traina, R.S. Filho, C. Faloutsos, How to improve the pruning ability of dynamic metric access methods, in: Int’l Conf. Information and Knowledge Manag., CIKM, McLean, 2002, pp. 219–226, http://dx.doi.org/ 10.1145/584792.584831. [16] A. Guttman, R-trees: A dynamic index structure for spatial searching, in: Int’l Conf. on Management of Data, SIGMOD, Boston, MA, 1984, pp. 47–57, http://dx.doi.org/10.1145/602259.602266. [17] H.L. Razente, R.M. dos Santos Sousa, M.C.N. Barioni, Metric indexing assisted by short-term memories, in: Int’l Conf. on Similarity Search and Applications, SISAP, LNCS 11223, Lima, Peru, 2018, pp. 107–121, http: //dx.doi.org/10.1007/978-3-030-02224-2_9. [18] T. Skopal, J. Lokoc, New dynamic construction techniques for M-tree, J. Discrete Algorithms (ISSN: 1570-8667) 7 (1) (2009) 62–77, http://dx.doi. org/10.1016/j.jda.2008.09.013. [19] H. Samet, Foundations of Multidimensional and Metric Data Structures, Morgan Kaufmann, San Francisco, 2006. [20] M.R. Vieira, C. Traina, F.J.T. Chino, A. Traina, DBM-tree: A dynamic metric access method sensitive to local density data, J. Inf. Data. Manag. 1 (1) (2010) 111–127.

9

[21] F. Korn, B. Pagel, C. Faloutsos, On the ‘dimensionality curse’ and the ‘selfsimilarity blessing’, IEEE Trans. Knowl. Data Eng. 13 (1) (2001) 96–111, http://dx.doi.org/10.1109/69.908983. [22] T. Skopal, J. Pokorný, M. Krátký, V. Snásel, Revisiting M-tree building principles, in: East European Conf. Adv. Databases and Inf. Syst., ADBIS, LNCS 2798, Dresden, Germany, 2003, pp. 148–162, http://dx.doi.org/10. 1007/978-3-540-39403-7_13. [23] R.T. Ng, J. Han, CLARANS: A method for clustering objects for spatial data mining, IEEE Trans. Knowl. Data Eng. 14 (5) (2002) 1003–1016, http://dx.doi.org/10.1109/TKDE.2002.1033770. [24] Arboretum, The database group at ICMC/USP arboretum library, 2019, https://bitbucket.org/gbdi/arboretum. (Accessed Aug, 2019). [25] D. Dua, C. Graff, UCI machine learning repository, univ. california, Irvine, school of Inf. and comp. sciences, 2017, http://archive.ics.uci.edu/ml. [26] K. Figueroa, G. Navarro, E. Chávez, Metric spaces library, 2007, Available at http://www.sisap.org/Metric_Space_Library.html. [27] N. Roussopoulos, S. Kelley, F. Vincent, Nearest neighbor queries, in: Int’l Conf. Manag. of Data, SIGMOD, San Jose, 1995, pp. 71–79, http://dx.doi. org/10.1145/223784.223794.

Please cite this article as: H. Razente, M.C.N. Barioni and R.M. Santos Sousa, A comprehensive analysis of delayed insertions in metric access methods, Information Systems (2020) 101492, https://doi.org/10.1016/j.is.2020.101492.