Large-scale parallel similarity search with Product Quantization for online multimedia services

Large-scale parallel similarity search with Product Quantization for online multimedia services

Accepted Manuscript Large-scale parallel similarity search with Product Quantization for online multimedia services Guilherme Andrade, André Fernandes...

1MB Sizes 0 Downloads 9 Views

Accepted Manuscript Large-scale parallel similarity search with Product Quantization for online multimedia services Guilherme Andrade, André Fernandes, Jeremias M. Gomes, Renato Ferreira, George Teodoro

PII: DOI: Reference:

S0743-7315(18)30856-6 https://doi.org/10.1016/j.jpdc.2018.11.009 YJPDC 3980

To appear in:

J. Parallel Distrib. Comput.

Received date : 4 April 2018 Revised date : 31 July 2018 Accepted date : 20 November 2018 Please cite this article as: G. Andrade, A. Fernandes, J.M. Gomes et al., Large-scale parallel similarity search with Product Quantization for online multimedia services, Journal of Parallel and Distributed Computing (2018), https://doi.org/10.1016/j.jpdc.2018.11.009 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

*Highlights (for review)

Special Issue Paper Submission Trends on Heterogeneous and Innovative Hardware and Software Systems Journal of Parallel and Distributed Computing (JPDC) Highlights

Paper title: Large-Scale Parallel Similarity Search with Product Quantization for Online Multimedia Services

1) Similarity search in high-dimension spaces is a core operation 2)

3)

4)

5)

found in several multimedia online retrieval applications. We propose an efficient parallelization of the Product Quantization based approximate nearest neighbor multimedia similarity search indexing (PQANNS). We also develop mechanisms (ADAPT and ADAPT+G) that change the parallelism and task granularity configurations during the execution to minimize the query response times in scenarios in which the query rates vary. The proposed mechanisms reduced the query response times in 6.4× as compared to the best static configuration of parallelism and task granularity. The distributed memory execution of PQANNS using 128 nodes/3584 CPU cores has attained a parallel efficiency of 0.97, and was able to index and search in a dataset with 256 billion SIFT vectors.

Sincerely yours, Guilherme Andrade André Fernandes Jeremias M. Gomes Renato Ferreira George Teodoro

*Manuscript Click here to view linked References

Large-Scale Parallel Similarity Search with Product Quantization for Online Multimedia Services Guilherme Andradea , Andr´e Fernandesb , Jeremias M. Gomesb , Renato Ferreiraa , George Teodorob a Department

of Computer Science, Universidade Federal de Minas Gerais, Brazil of Computer Science, University of Bras´ılia, Brazil

b Department

Abstract The similarity search in high-dimensional spaces is a core operation found in several online multimedia retrieval applications. With the popularity of these applications, they are required to handle very large and increasing datasets, while keeping the response time low. This problem is worsened in the context of online applications, mostly due to the fact that load on these systems vary during the execution according to the users demands. Those variations require the application to adapt during the execution in order to minimize the response times. In this paper, we address these challenges with an efficient parallelization of the Product Quantization Approximate Nearest Neighbor Search (PQANNS) indexing. This method is capable of answering queries with a reduced memory demand and, coupled with a distributed memory parallelization proposed here, can efficiently handle very large datasets. We have also proposed mechanisms to minimize the query response times in online scenarios in which the query rates vary at run-time. For this sake, our strategies tune the parallelism configurations and task granularity during the execution. The parallelism and granularity tuning approaches (ADAPT and ADAPT+G) have shown, for instance, to reduce the query response times by a factor of 6.4× in comparison with the best static configuration of parallelism and task granularity. Further, the distributed memory execution using 128 nodes/3584 CPU cores has attained a parallel efficiency of 0.97 with a dataset of 256 billion SIFT vectors. Keywords: Multimedia Similarity Search, Descriptor Indexing, Dynamic Parallelism, Product Quantization

1. Introduction A similarity search refers to finding the most similar object(s) (nearest neighbors) to a query into a dataset. This is an important operation in multimedia Email addresses: [email protected] (Guilherme Andrade), [email protected] (Andr´ e Fernandes), [email protected] (Jeremias M. Gomes), [email protected] (Renato Ferreira), [email protected] (George Teodoro)

Preprint submitted to Journal of LATEX Templates

July 30, 2018

5

10

15

20

25

30

35

40

45

retrieval applications [1, 2, 3], which typically represent multimedia objects using high-dimensional feature vectors. The complete processing in content-based multimedia retrieval (CBMR) applications, such as image search engines, may involve multiple complex steps, but similarity search is typically one of the most compute demanding. Due to the high-dimensionality of the multimedia descriptors and the large number of descriptors found in datasets employed in this domain, the use of exact brute-force algorithms may be prohibitive. Thus, several indexing algorithms and data structures have been introduced with the goal of reducing the search space, for instance, by partitioning the data objects (feature vectors or descriptors) spatially. This partitioning is then used to prune the search space by avoiding looking into partitions of the dataset that would not contain nearest neighbors. Some of these indexing approaches use kd-trees [4], k-means trees [5], cover trees [6], etc. Still, these attempts perform poorly as the space dimensionality increases because of the known “curse of dimensionality” [1, 7]. For applications in which the exact answer can be traded-off for speed, the approximate neighbors neighbors (ANN) search can be used to improve performance. Some of the most successful ANN indexing solutions include the FLANN (Fast Library for Approximate Nearest Neighbors) [8], the LSH (Locality-Sensitive Hashing) [9], and the Product Quantization ANN Search [10] (PQANNS). The PQANNS has been shown to exceed the performance of competitors in terms of execution time and memory demands [10]. Most of these works have focused on achieving maximum performance on a sequential setting in which a batch of queries is computed. However, the demands of modern online applications include (i) indexing very large datasets that would not fit into the memory of a single machine, and, (ii) minimizing response times of individual queries (vs. a batch of queries) under online service scenarios in which the query arrival rates fluctuate at run-time. In order to address the aforementioned challenges, in this work, we have proposed and implemented a distributed memory parallel version of the PQANNS indexing. This parallelization was carried out by decomposing PQANNS into a set of dataflow stages that communicate asynchronously. We have also performed the internal parallelization of the application stages, allowing for a single stage to fully utilize each computing node of the distributed environment. This minimizes the number of the input dataset partitions and, as a consequence, reduces the network communication demands during the search. The intrastage parallelization is flexible and allows for a dynamic binding of computing resources to execute queries arriving in the system. In this strategy, a tuple is adjusted with the goal of minimizing queries response times. The outer parallelism configuration denotes the number of queries being executed in parallel inside each copy of the application stages. The inner parallelism, on the other hand, refers to the number of computing threads used to execute each query. For instance, in a machine with 8 computing cores/threads, the parallelism tuple <1,8> would lead to a single query being processed in parallel by 8 threads. The tuple <8,1> would consist of 8 queries being concurrently executed by different threads. Please, note that

2

50

55

60

65

70

75

80

85

90

this parallelism configuration is internal to a process running into a single node. We define query as each feature vector that is used to search for ANN in the dataset. In a complete multimedia application, multiple feature vector queries may be required to answer, for instance, an image search query depending of the methods used. However, the feature vector search remains on of the most expensive phases on those applications. The tuning of the parallelism is computed at run-time in response to load variations on the system. In cases with low loads (query rates), the availability of computing resources is high. Thus, multiple CPU cores should be used to accelerate the execution of each incoming query. As the application load increases, however, several queries are ready for computation, and increasing the outer (and decreasing the inner) parallelism tends to be the best option. In the latter case, the system throughput will increase due to the reduction of the inter-thread synchronization, as few or a single thread is used to execute a query. In both cases, the query response times are reduced by the adequate parallelism tuning. We have also extended the tuning to the granularity of the tasks processed by an application stage. In our case, the granularity of the computation task may be changed by grouping multiple queries into a single task. In essence, this grouping will create coarse-grain tasks. This amortizes the cost for dispatching a task for execution and, as a consequence, improves the system throughput. As the granularity increases, however, the query execution times also grows. As such, it must be used carefully to avoid negatively affecting the user-observed response times. Given this additional aspect for tuning, our system will adjust a tuple according to the observed load with the goal of minimizing the response times. This is computed at run-time in response to system load variations by the ADAPT+G tuning algorithm proposed in this work. This paper extends our preliminary work [11] with a novel distributed memory parallelization of the PQANNS using the Message Passing Interface (MPI), which has been evaluated, to the best of our knowledge, using the largest dataset employed in the literature. We have also compared the PQANNS to state-of-theart indexing methods, and extended the discussion of our auto-tuning strategies, background, and related work. The contributions of our work can be summarized as follows: • We proposed novel distributed memory PQANNS with a dataflow decomposition, which is coupled with a task-based run-time system to efficiently exploit machines equipped with multicore CPUs. • We developed strategies to dynamically tune the application parallelism and task granularity with the goal of minimizing the average query response times. These approaches have been able to reduce the response times as compared with static settings by up to 6.4×. • The evaluation of the distributed PQANNS indexing has resulted in a parallel efficiency of about 0.97 (97%) in a distributed memory machine with 128 nodes and 3584 CPU computing cores. This experiment used a 3

95

100

dataset with 256 billion SIFT descriptors, which is 5.9× larger than the largest dataset employed by related works [12]. The reminder of the manuscript is organized as follows: Section 2 presents the related work on multimedia similarity search and its parallel approaches, whereas Section 3 details the Product Quantization based similarity search indexing. The approach to parallelize PQANNS to distributed memory machines is presented in Section 4, and the parallelism and task granularity tuning strategies are discussed in Section 5. The propositions are evaluated in Section 6, and we conclude and present directions for future work in Section 7. 2. Related Work

105

110

115

120

125

130

This section describes the main indexing algorithms used for ANN search, and presents the closely related works on parallel and distributed ANN. 2.1. Nearest Neighbors Search The nearest neighbors (NN) search problem has received increasing attention in the last decades. Several efforts have focused on the development of data structures, including kd-tree [4], k-means tree [5], cover tree [6], and others, to provide a locality-aware partition of the input data, allowing to prune the search space during a NN search. However, this ability to find relevant partitions in the space quickly degrades as data dimensionality increases. This phenomenon is the well-known “curse of dimensionality” [1, 7]. The approximate nearest neighbors (ANN) search has been proposed to improve the scalability and speed of NN search in high-dimensional spaces for applications in which exact answers are not essential. Several competitive ANN techniques and algorithms have been proposed [9, 8, 10, 13, 14]. The Product Quantization based ANN search [10] (PQANN) is a successful approach that decomposes the space into a Cartesian product of subspaces to further quantize them. The vector created from the quantized subspaces is then used to estimate Euclidean distances. The locality sensitive hashing (LSH) [9] based indexing is another competitive technique for similarity search in high-dimensional spaces. It employs locality sensitive hashing functions to index and search in the high dimensional space. Multicurves [13] performs multiple projections of subsets of the data objects subspaces into an 1-dimensional space using space-filling curves. Its search phase executes the same projections using the query object, and uses them to retrieve the nearest points in the 1-dimensional sorted lists. FLANN [8, 15] builds a framework that automatically selects the best strategy among algorithms, such as randomized kd-trees [16], hierarchical k-means [17], and LSH [9], for a given dataset, showing that a single strategy or indexing is not always the best fit for all types of datasets. The comparison against FLANN is used in many works to demonstrate the effectiveness of an indexing algorithm.

4

135

140

145

150

155

160

165

170

175

2.2. Parallel and Distributed ANN Search The ANN requirements include searching in large datasets, achieving high throughput, and providing low response times to the end-users. These demands have motivated the development of ANN indexing methods that make use of high performance techniques and scalable distributed memory machines [18, 19, 20, 21, 22, 3, 23, 24, 15]. The MapReduce based parallelizations of LSH [18, 19] are interesting related works. In the work of Stupar et. al [18] the MapReduce formulation of LSH has: i) a map phase that independently visits buckets to which a query object is hashed in order to generate a per bucket nearest neighbor candidate set, and, ii) a reduce phase to aggregate results from all visited buckets. This LSH implementation stores the buckets of points in a distributed file system (HDFS) using a single file per bucket. As reported by the authors, combinations of LSH parameters may create a very large number of files (buckets) and reduce the overall system performance. In addition, this solution stores data objects content in the bucket (files) for each hash table used, instead of the object identifier (pointer) as in the original algorithm. As a consequence, the entire dataset is replicated for each of the hash tables used by LSH. An efficient configuration of LSH may require hundreds of hash tables. This level of data replication is prohibitive for large datasets. Also, the high latency of data accesses makes it inefficient for online applications due to the high query processing times. Bahmani et al. [19] implemented another MapReduce-based variant of the LSH algorithm that is referred to as Layered LSH. They have implemented two versions of LSH using: 1) Hadoop for file system based data storage and 2) Active Distributed Hash Table (DHT) for in-memory data storage. They proposed theoretical bounds for network traffic assuming that a single LSH hash table is used. If multiple hash tables are used, the same data object is indexed in buckets from different hash tables and, as a consequence, the data partition would be more complex. As such, neither of the MapReduce based parallelizations of LSH [18, 19] solves the challenging problem of building a large-scale indexing that minimizes communication and avoids data replication, while preserving the behavior of the sequential algorithm and providing low query response time. As presented in Section 4, the parallelization strategy we propose in the paper addresses these limitations. A parallel version of FLANN on top of the Message Passing Interface (MPI) [25] was proposed in [15]. Although interesting, this implementation has high memory bandwidth demands due to the algorithms implemented in FLANN. This leads to saturation on the memory bandwidth and limits the scalability. GPU based implementations of PQANNS have also been proposed [26, 27], but they are limited to a single machine execution and, as such, can not handle very large datasets. Finally, the works [28, 12] present a novel indexing strategy based on trees, which was able perform searches in datasets with up to 43 billion descriptors. These works implement an interesting batch searching approach that bundle multiple query feature vectors into a single query batch. This optimization is intended to reduce the system execution time overheads and improve in-memory data reuse to increase the throughput. In our approach, we also

5

180

185

group queries for execution by increasing the task granularity. However, we perform that grouping dynamically as the observed input query loads vary with the goal of minimizing individual query execution times. Also, Moise et al. [28] implements a full image search engine in which the input is an image and the output are similar images, whereas the other compared works (including ours) focus on the core operation of performing feature vector searches. In order to simplify the comparison of the available parallel approaches, we have summarized the characteristics of the distributed memory based ANN search solutions on Table 1. Table 1: Overview of the distributed memory ANN approaches discussed in this section. We have used the largest dataset for which the authors reported the execution times.

190

195

200

205

210

Work

Algorithm employed

Dataset size

# of nodes

Response time-aware

Task granularity tuning

[18] [19] [15] [22] [28, 12] Ours

LSH LSH FLANN Multicurves Index Tree PQANNS

100K 1M 80M 130M 43B 256B

1 16 4 8 100 128

No No No Yes No Yes

No No No No No Yes

2.3. Adaptive Parallelism Parallelism tuning is increasingly attracting interest of the high-performance computing community. Therefore, several run-time systems and interfaces have been proposed to exploit the performance benefits of this technique for different applications and hardware architectures with specific performance goals [29, 30, 31, 32, 22]. In [31] the authors have shown how the availability of CPU processing can vary, for different reasons, during the execution of an application in the domain of CMPs (chip multiprocessors). They have studied how to adapt the parallelism aiming at the metric EDP (energy-delay) product. To do so, they proposed the use of helper threads that run in parallel to the application execution to decide the ideal number of CPUs threads and voltage/frequency levels. Similarly, the work [29] proposed scheduling strategies for dynamically adapting the loop nesting level and degree of parallelism on a Cell Broadband Engine system with user-level schedulers. Further, in [32], machine learning techniques have been proposed to choose the best parallelism configuration. These works adapt the parallelism targeting to improve energy efficiency. The DoPE [33] system provides an API that enables the developer to express parallelism in loop nests, and allows administrators to specify different performance goals. Finally, considering the application domain under study in this work, Hypercurves [3] proposed dynamic strategies for adapting parallelism and allocation of threads to computing cores, achieving a reduction of up to 74% in query response time when compared to static approaches. The aforementioned run-time

6

215

220

systems, APIs, and algorithms try to adapt the parallelism levels targeting performance. We presented, in this work, an adaptive task-based run-time system applied to the PQANNS search algorithm, introducing new adaptive approaches that can reconfigure at run-time to minimize queries response times under fluctuating loads as observed in online applications. We also introduced the concept of task granularity as an additional component to be tuned for minimizing response time by improving the system or application throughput/performance. 3. Product Quantization based Approximate Nearest Neighbor Search (PQANNS) This section presents the approach for ANN search based on Product Quantization [10] that is parallelized in this work. The concepts of quantization are first described, whereas the PQANNS algorithm is further detailed.

225

230

235

3.1. Quantization and Subspaces The process of quantizing a vector x refers to a mapping of x to a lower dimensional vector. As the quantization is calculated, information is lost as the quantized counterpart of x can not provide the same details of the original vector. In order to minimize the amount of information loss, PQANNS computes quantization through a mapping of the D-dimensional vector x to another vector q(x) ∈ C = {ci ; i ∈ τ }, where the index set τ is in a finite: τ = 0 . . . k-1. The ci values in this case are centroids calculated in a pre-processing phase using a k-means algorithm on top of a sampling dataset. In the mapping, the original vector will be represented by the nearest centroid ci ∈ C, and it is calculated as: q(x) = arg min d(x, ci ) ci ∈ C

240

(1)

As shown in Figure 1, quantizing a space can be seen as Voronoi diagram calculation with a centroid per Voronoi cell. Quantizing the entire D-dimensional space of x together into a single set C would require k or the number of centroids to be very large. Therefore, the quantization approach used in PQANNS divides x into m subspaces that are quantized independently (Equation 1). Each distinct subvectors uj , 1 6 j 6 m, having D∗ = D/m dimensions. The quantization of vector x can be represented as: x1 , ..., xD∗ , ..., xD−D∗ +1 , ..., xD → q1 (u1 (x)), ..., qm (um (x)) | {z } {z } |

(2)

q(y) = q1 (u1 ) × q2 (u2 ) × ... × q3 (ud )

(3)

u1 (x)

um (x)

As a result, the quantization of x will be a Cartesian product of its quantized subvectors:

245

This approach allows for low-complexity quantizers from each j th subvector to be combined to create a higher-complexity indexing, which will address a

7

250

255

260

codebook (τ = τ1 ×. . .×τm ). The codebook is then constructed from the Cartesian product of multiple small centroid sets C = C1 × . . . × Cm. Assuming that each Cj contains k centroids, a total of k m possible quantization combinations are available. This allows for creating large codebooks and for the reduction of the data dimensionality. 3.2. Distance Calculation in Quantized Spaces After the quantization is computed, vectors in the dataset are represented by their indexes in the codebook. Therefore, the nearest neighbors search will be computed in the quantized space using the codebook index. The authors of [10] proposed two strategies to approximate the distance between a query vector x and the quantized values of vectors stored in the dataset (q(y)) as follows. Symmetric Distance Computation (SDC). In this method, the quantizers for each vector x and y (q(x), q(y)) are used. The approximation of the distance d(x, y) is calculated as follows. v uX um ˆ y) = d(q(x), q(y)) = t d(qj (x), qj (y))2 (4) d(x, j=1

Asymmetric Distance Computation (ADC). The asymmetric distance is computed using quantizers of the indexed (dataset) vectors and the actual query vector. As such, ADC tries to improve the quality of the approximation by using x instead of q(x). The distance is computed as follows. v uX um ˜ d(uj (x), qj (uj (y)))2 (5) d(x, y) = d(x, q(y)) = t j=1

265

270

275

The visual representation for the distance approximations using SDC and ADC are shown in Figure 1. As presented, the main difference between them is that SDC uses the actual x value instead of its quantization (q(x)). This is a tentative to improve the accuracy of the distances. 3.3. Searching in Quantized Spaces The approximate nearest neighbor search in quantized spaces is able to reduce both the computation and memory demands [10]. However, even in quantized spaces, using methods that exhaustively compare a query vector to every element in the dataset would be expensive. As such, in order to reduce the number of comparisons during a search, it has been proposed the use of an inverted file system (inverted list) along with ADC (IVFADC). In this scheme, the inverted list intends to group descriptors (vectors) that are similar or close in the quantized space into each of its entries. The entries in the inverted list are represented by “coarse quantizer” (centroids), which are also learned using a k-means clustering algorithm in the training dataset.

8

x

x

q(x)

y

y q(y)

q(y)

Symmetric

q(x)

SDC Real ADC

Asymmetric

Figure 1: Representation of the Symmetric (left) and Asymmetric (right) distance computations. The Symmetric case uses only quantized values to compute distances d(q(x), q(y)), whereas the Asymmetric employs the actual x vector and the quantized vectors of the dataset d(x, q(y)).

280

285

290

295

The indexing structure with the inverted file structure is presented in Figure 2. As discussed, each entry of the inverted list is associated to a coarse quantizer, and each entry contains a list of vectors that are represented by ID and Code. The ID could be the image from which the vector was computed, and the Code is the distance between the vector and the corresponding coarse centroid. The Code is used during the search to improve the search results. The same figure illustrates the indexing and searching steps of the algorithm. The Indexing is shown on the top part of Figure 2. Given each vector y from the dataset, the algorithm computes the following steps. 1. quantize y: qc (y); 2. compute the residual value (r(y)) from the quantized vector (qc (y)) and the actual descriptor: r(y) = y − qc (y); 3. quantize r(y) to qp (r(y)), which, for the product quantizer, amounts to assign uj (y) to qj (uj (y)), for j = 1 . . . m; 4. insert the new item in the inverted list entry corresponding to qc (y) with the vector identifier quantized residual value. The Searching of the nearest neighbors of x is performed in the following steps:

300

305

1. the x vector is quantized to the w nearest neighbors in the coarse quantizer codebook (qc ). The algorithm uses w elements to allow for the search to take place in multiple inverted list entries, which may be necessary when a single entry is not sufficient to attain the desired quality. The next steps are repeated for each of the w inverted list entries; 2. compute the distance between each subquantizer j and the associated centroid; 3. calculate the distance between r(x) and other elements in that inverted list entry; 4. Retrieve the k nearest neighbors of x based on distances calculated in the previous step.

9

Figure 2: Overview of the inverted file with asymmetric distance computation (IVFADC)

10

4. The Distributed Memory Parallelization of PQANNS 310

315

320

This section describes the parallelization of the PQANNS algorithm (the efficient IVFADC version described in previous section) targeting distributed memory machines equipped with multi-core CPUs. We first present an overview of the parallelization on Section 4.1, and our strategy to efficiently utilize multicore CPUs is presented on Section 4.2. 4.1. Distributed Memory Parallelization The strategy for the parallelization of the PQANNS (IVFADC) algorithm consists of partitioning the dataset among nodes of a distributed memory machine, performing the search locally in each of these nodes, and computing a reduction to aggregate node local results. The most costly phase of the algorithm is the search, which, as a consequence, is the focus of multiple optimizations. Our parallelization has been developed on top of the Message Passing Interface (MPI) [25], but the application is decomposed into stages in a dataflow style [34, 35, 36] and those stages communicate through “directed streams”.

Figure 3: PQANNS decomposition into the dataflow programming paradigm. The index building phase partitions the input dataset among the Index Search copies without data replication (message i). During the search phase, the Index Search stage copies receive information about the query and locally compute the NN in their partition of the input dataset. Further, the Aggregator receives those local results and compute the global NN results.

325

330

The parallelization strategy decomposes the PQANNS into four computing stages, and each of them may be replicated in the distributed memory machine with the number of required copies (Figure 3). These stages are structured into two pipelines that perform the index construction and search phases of the application. The index construction involves the Input Reader and Index Search stages. In this phase, the Input Reader copies will read the files with the input dataset and, for each dataset descriptor or vector (y) to be indexed, this stage will quantize it and send the vector ID (y id), the coarse grain centroid (qc (y)), and the quantized Code (q(r(y))) to be stored in the Index Search stage. The descriptors are sent to the Index Search in a round-robin fashion, so that 11

335

340

345

350

the dataset is evenly distributed among the machines storing the index without replication. This is important to avoid load imbalance during the search phase. The search phase of the PQANNS employs three stages: Input Query, Index Search, and Aggregator. The Input Query reads the flow of arriving queries (vectors), for instance, received from a web search engine interface. For each query vector, it quantizes the vector to the nearest w centroids, and the quantization information is forwarded to the Index Search using a broadcast (Algorithm 1). It is important to highlight that this broadcast has a small impact on the scalability of the workflow, because the application is dominated by the search in the index and communication is carried out in background to computation by communication threads. After the message is received by the Index Search, it will retrieve the k-nearest neighbors to the query in its partition of the dataset, and send that local response to the Aggregator (Algorithm 2). The Aggregator receives the information from each Index Search and computes the global k-nearest neighbors (k-NN) response (Algorithm 3). It first merges the k-NN to the results already received from other nodes (reduce operation), and if local results from all Z Index Search instances have arrived, the global k-NN results are outputed.

Algorithm 1 Input Query Algorithm 2 Index Search query ← ∅; while inputQuery 6= ∅ do query ← read(inputQuery); while true do query ← MPI Recv(...); quantized query ← quantize(query, local knn ← PQANNS(query); w); Aggregator dest ← x id % D; MPI Bcast(quantized query, ...); MPI Send(knn, Aggregator dest); end end Algorithm 3 Aggregator while true do knn ← MPI Receive(...); reduce(knn, x id); count(x id)++; if count(x id) == Z then outputGlobalKNN(x id); end end

355

As one may have noticed, the messages sent from Index Search to Aggregator employ a communication policy called labeled-stream [37]. This policy associates a label or a tag to messages (x id), which is used to route all messages with the same tag value to a single copy of the receiver stage. This mapping of a tag to a receiver is calculated using a hash function that has the tag as an input parameter. This function returns a value that corresponds to the identifier of 12

360

365

370

375

380

385

390

395

400

the receiver stage copy (a value between 1. . .D in our case - see Figure 3). We use this communication scheme to compute a parallel reduction of query partial (node local) results calculated by the Index Search copies, as multiple Aggregators may be executed in the environment. Our parallelization also allows for the index building and searching phases to be performed concurrently. 4.2. Intra-Stage Parallelization The intra-stage parallelization refers to a stage’s ability to execute tasks in parallel using multiple computing cores available in a node. This level of parallelism is essential to fully take advantage of current machines, which are typically built as multi-socket, multi-core systems. The intra-stage parallelism in our application is implemented by using a run-time system that we developed to run on top of the application stages. In particular, our focus for intra-filter parallelism has been on the Index Search stage due to the fact that: (i) it is the most compute expensive phase of the application; and, (ii) it stores most of the application state. As such, a smaller number of copies of this stage leads to a lower number of data partitions, one per machine instead of one per CPU computing core. As a consequence, it reduces communication demands during the search, since each partition of the index needs to be consulted in this phase. The stage run-time developed in this work expresses the stage computation through a workflow of fine-grain tasks, which may be executed sequentially or in parallel. This flexibility in the tasks description allows the run-time to: (i) dynamically adapt the level of parallelism or number of computing threads assigned for executing a task; and, (ii) modify the granularity of a task in the run-time by grouping multiple queries into a single more compute demanding coarse-grain task. These two aspects are tuned by our run-time as the load observed by the application vary, as detailed in Section 5, with the goal of minimizing the response time observed by the user. Our execution model represents a sequence of functions with their parameters as a task, which is registered with the run-time by the application developer along with dependencies. Tasks can be created and submitted to the run-time system during the execution. In this way, the data elements or requests arriving at the application stages are mapped into tasks and executed in the background to communication. The run-time has threads that compute the tasks, and another thread responsible for creating tasks and managing the communication between application stages. The run-time system determines the number of tasks executed concurrently and the number of threads used to process each task. This parallelism configuration is represented by a tuple parallelism. While these values may be modified dynamically, the threads used in the actual computation are reused out of a pool of threads to minimize overheads. We employed OpenMP 3.1 for the threads implementation due to its known efficiency. The steps of a task execution in our run-time are presented in Figure 4. Once a task is submitted, it is inserted into a Task Queue (Step 1). If there are worker threads available, they are woken up and will retrieve a task from the Queue (Steps 2 and 3). For each function within the task, the outer thread

13

Figure 4: Adaptive Task-Based Run-time Engine

will configure the number of inner threads used to compute it (Steps 4-8). The system executes tasks in the same order they are registered. Once the execution of a task has finished, the outer thread will try to return another task from the Queue. The three main steps of the Index Search stage of our PQANNS algorithm are implemented in tasks as shown in Listing 1. The “compute distance”, “sum distances”, and “kmin” variables are pointers to functions that implement the search steps. Functions annotated with PARALLEL can use multiple inner threads, whereas the ones with SEQUENTIAL execute using a single thread. In our example, the first two functions run in parallel. The last step is inexpensive as compared to others and does not benefit from parallelism. Our approaches to setup the tuple parallelism and the task granularity to minimize query time are presented in the next section.

405

410

415

1

Runtime runtime = s t d : : make shared(numOuters , numInners ) ;

420 2 3

auto t a s k = s t d : : make unique() ;

4 5 6 425 7 8 9 10 11 43012

tas k −>setTaskType (CPUTASK) ; tas k −>addFunction ( c o m p u t e d i s t a n c e s ,PARALLEL) ; tas k −>addFunction ( s u m d i s t a n c e s ,PARALLEL) ; tas k −>addFunction ( kmin , SEQUENTIAL) ; tas k −>addInputParam(& i n p u t q u e r y o b j e c t s ) ; tas k −>addOutputParam(& n n o b j e c t s ) ;

14

13

runtime−>submitTask ( s t d : : move ( t a s k ) ) ; Listing 1: Run-time API and the creation of an example task with multiple steps (functions).

5. Parallelism and Task Granularity Auto-Tuning

435

440

445

450

455

Our application domain presents interesting challenges as the execution time of each of the application internal tasks (queries) will affect the experience of the user with the service. Therefore, tuning the application or run-time system to maximize the throughput will not result in minimizing the user observed response times in several scenarios. For instance, in the case in which the load of the system (number of incoming queries) is low, it may be more efficient to employ a large number of computing resources (i.e., CPU cores) to process a single query, whereas the configuration that leads to higher throughput would employ a single CPU core per query. This trade-off is discussed and experimentally presented in Section 6.3. In addition to the parallelism, modifying the granularity of the tasks may also reduce the per task overheads and, as a result, improve the application efficiency. However, the granularity, such as the parallelism, impacts the response time and should be used carefully according to the observed load. In order to address the aforementioned challenges, in this section, we propose an algorithm for tuning the parallelism and task granularity during the execution with the goal of minimizing query response time (Algorithm 4). This algorithm (ADAPT) continuously measure the Task Queue Size (TQS), which may vary based on the load of the application, to decide which parallelism configuration to use. The main goal of ADAPT is to minimize the query response time = queued time + processing time. When the load is higher than the application throughput using a given parallelism configuration, TQS will grow and the queued time will dominate the query response time. In this case, ADAPT will increase the outer parallelism to increase the system throughput and reduce TQS. In scenarios of low loads, TQS is small. Thus, queued time is insignificant and increasing the inner parallelism will help in minimizing the processing time and, as a consequence, the overall query response time. Algorithm 4 ADAPT Result: CC = Total Computing Cores; Iteration = 0; while newN umOuters < CC do newN umOuters = 2Iteration if newN umOuters ≥ T QS then break; end Iteration ++; end newN umInners = CC/newN umOuters;

15

460

465

470

475

Further, the ADAPT+G algorithm has been created in order to consider task granularity. This algorithm extends the ADAPT to increase/decrease the task granularity according to the system load. ADAPT+G works by increasing the task granularity if TQS is growing and the number of outer parallelism is maximum (equals to the number of cores or threads). In other words, if the outer parallelism can not be increased and the Queue is still growing, the task granularity is the remaining option to improve the throughput and reduce the queued time. However, if the TQS starts to decrease and the granularity of tasks is greater than a query per task, the granularity is reduced. In other cases, ADAPT+G executes similarly to ADAPT. In all cases, we restrict the algorithms decision to configurations in which the outer × inner parallelisms is equals to the number of computing cores. ADAPT and ADAPT+G are executed in Step 4 of our run-time system (Figure 4). Before reconfiguring the system (Step 5), the outer thread waits for all the other outer threads (or workers) to finish the execution of the current function call. When all workers are done, the new parallelism configuration is used. 6. Evaluation

480

485

490

495

500

This section describes the computing environments and datasets used in our evaluation, compares our approach based on product quantization to a state of the art indexing method (FLANN), evaluates strategies to tune the application, and performs large-scale experiments to analyze the scalability of our solution. 6.1. Experimental Setup The experimental evaluation was performed using two computing environments. The first one is a local machine used to execute small-scale experiments. This machine is equipped with an 8-core Intel Xeon E5-2690 and 16 GB of RAM, and it was employed in experiments reported in Sections 6.2, 6.3, 6.4, and 6.5. The second environment is a distributed-memory machine with 128 nodes interconnected through a FDR Infiniband switch. Each computing node has a dual-socket Intel Haswell E5-2695 v3 CPU and 128 GB of RAM, and all machines run Linux OS. This setup was used in large-scale evaluations reported in Section 6.6. The algorithm has been executed using the Intel MPI version 3.1, but it can also be linked to other implementations such as OpenMPI. We employed three datasets with local 128-dimensional SIFT descriptors [38]. We have three subsets per dataset: learning, database, and query. The database datasets used have 1M (million), 10M, and 256B (billion) SIFT descriptors extracted from 256 million images collected from the web. The first two datasets were previously used in [10], whereas the third was generated in this work. Although SIFT descriptors have been employed in our evaluation, we want to highlight that other approaches based on global descriptors, including VLAD (vector of locally aggregated descriptors) [39] and Fisher Kernels [40], would benefit from our efficient indexing system as demonstrated in prior work [39].

16

505

510

515

520

525

530

535

540

545

To evaluate the quality of the results returned by the methods, we have used the recall@R that is the proportion of nearest neighbors to the query returned that are ranked in the first R positions of the results. For instance, if R=1, it will check whether the value returned by the methods is the actual nearest neighbor of the query. This metric corresponds to the precision as used in other works [5]. 6.2. Comparison to State of the Art: FLANN The approach for approximate nearest neighbor search available in FLANN [8, 15] is known to be efficient. As discussed, it employs hierarchical structures, i.e., using kd-trees and k-means trees, and can select the most efficient algorithm out of the choices available. It also tunes the algorithm’s parameters in a training phase to attain the best performance. An essential difference between FLANN and IVFADC/PQANNS is that the first maintains all vectors (descriptors) in RAM, because FLANN executes a re-ranking phase that computes the actual distance between the query vector and the candidate nearest neighbors. In PQANNS, on the other hand, only quantized values are kept in memory, reducing significantly the memory demands. The evaluation performed here presents the results for 1-recall@1, which is the average proportion nearest neighbors in the returned vectors or the precision [5]. Although only R=1 is used, the behavior of the methods is similar for larger values of R. The experiments were executed on our local machine equipped with an 8-core Intel Xeon E5-2690, and both FLANN and IVFADC algorithms have indexed the dataset with 1M vectors and 10K query vectors. The methods have been tuned to compare the search times for different quality results. The FLANN parameters are chosen automatically by the tool, whereas for IVFADC the w and the number of coarse centroids (w/# of centroids) were varied. For the sake of comparison, the same experiment has also been executed for the exact algorithm that computes the k-NN using the Yael [41] library. Yael is an optimized library that implements the exact algorithm with functions from BLAS/LAPACK, which are well known to be efficient. For reference, in that case, the experiment took about 212 seconds. The experimental results comparing IVFADC/PQANNS and FLANN as the precision is varied are presented in Figure 5. As may be noticed, PQANNS is more efficient in nearly all cases. Moreover, PQANNS can attain this performance using only about 25 MB of RAM, while FLANN requires more than 600 MB of RAM. As compared to the exact search, both approximate methods attain significant performance improvements. 6.3. The Parallelism Configuration Impact to Throughput and Response Time This section presents the impact of the application parallelism configuration ( parallelism) to the application throughput (queries/second) and query response time. These experiments employed the dataset with 10M descriptors and 100K query vectors, and the PQANNS has been setup with eight quantizers and 28 coarse centroids. The throughput results are presented in Table 2, and the experiments were executed in the local machine equipped 17

Figure 5: IVFADC vs FLANN: trade-offs between search quality (precision) and search time using the 8-core Intel Xeon E5-2690 machine.

with an 8-core Intel Xeon E5-2690. As expected, the best performance for this metric is attained with the highest outer parallelism value, which leads to low thread synchronization overheads. Table 2: Application throughput as the the parallelism configuration () is varied using the 8-core Intel Xeon E5-2690 machine.

<8,1> <4,2> <2,4> <1,8>

550

555

Throughput (queries/s) 80.2 71.6 64.6 59.1

The evaluation of the query response time for each parallelism configuration as the application load factor varies is presented in Figure 6. In this experiment, the query rates vary during the execution of each test following a Poisson distribution with mean equals to load factor × maximum throughput (80.2). This experiment evaluates the scenario of an actual online multimedia service in which we expect to experience different loads over a single run. The results show that parallelism configurations that minimize the response time vary according to the load factor. This indicates that multiple settings should be used over the execution to reduce the response time as the load changes.

18

Figure 6: Response time as the query rate (load factor) varies within each single execution following a Poisson distribution using the 8-core Intel Xeon E5-2690 machine.

560

565

570

6.4. The Effect of Task Granularity to Throughput and Response Time The task granularity is defined in our application as the number of queries bundled for execution as a single task. In this section, we evaluate the impact of changing the granularity to the application efficiency and query response time. Table 3 presents query execution times as the granularity is increased using the 8-core Intel Xeon E5-2690 machine. As expected, the execution time grows with the granularity, but it increases at a lower rate. For instance, in the experiment using one thread, the granularity is increased 10× while the execution time grows only 8×. Thus, higher task granularity implies in a better use of the resources and higher throughput with the cost of increasing the query response time. Because the query loads change during the execution of online multimedia services, similarly to the parallelism configuration case, the task granularity needs to modified on-the-fly to minimize the query response time. Table 3: Average query execution time in seconds as the task granularity (number of queries per task) increases using the 8-core Intel Xeon E5-2690 machine.

Task Granularity 1 2 3 4 5 10

1 0,0999 0,1634 0,2482 0,3330 0,4200 0,8050

19

Threads 2 4 0,0558 0,0309 0,0916 0,0507 0,1396 0,0775 0,1809 0,1003 0,2261 0,1269 0,4161 0,2258

8 0,0159 0,0273 0,0370 0,0469 0,0599 0,1092

575

580

585

6.5. The Performance of Strategies to Tune Parallelism and Task Granularity In this section, we evaluate the performance of our proposed algorithms to tune the application parallelism and task granularity at run-time aiming to minimize query response time. As previously discussed, the input query rate or load factor vary during the execution of online service because of the user’s demands. Thus, a single parallelism or granularity is not able to minimize the response time under fluctuation loads. The evaluation of the adaptive tuning is carried out using input query loads that vary with a Poisson distribution as presented in the previous section. The performance of the application using the 8-core Intel Xeon E5-2690 machine for various static configurations and our proposed dynamic algorithms are shown in Figure 7. The first lines present the performance of the algorithm for different static parallelism configurations with task granularity of one query. As before, neither of the static settings were able to perform well for all load factors. Further, we present our dynamic algorithms: “ADAPT + 1” (ADAPT) that adapts the parallelism configuration using the task granularity of one query, and ADAPT+G that, in addition to the parallelism configuration, tunes the task granularity.

Figure 7: Evaluation of the dynamic parallelism and granularity tuning using the 8-core Intel Xeon E5-2690 machine.

590

595

As shown, the adaptive strategies were able to change the parallelism, keeping the average response time near or inferior to the best static configurations in nearly all cases. For load factors smaller than 0.6, ADAPT and ADAPT+G presented results close to the best static configuration (<1,8>), which demonstrates that even in low loads, when the configuration with higher inner parallelism is expected to perform the best, our adaptive strategies are competitive. However, when the query load factor increases, the adaptive approaches strongly outperform the best static configuration. For instance, for a load factor of 0.8, the ADAPT+G leads to a response 20

600

605

610

615

time about 2.43× smaller than that observed by the best static parallelism configuration, whereas this improvement can be of the order of 6.4× for a load factor of 1. Also, the gains with the adaptive task granularity tuning of the ADAPT+G on top of the ADAPT are only significant, as expected, for high loads. For a load factor of 0.9, ADAPT+G deliveries a response times 2.8× lower than that of ADAPT. 6.6. Scalability Evaluation This section evaluates the scalability of our parallel distributed memory version of PQANNS. This experiment has been executed using our largest dataset with 256B SIFT vectors. The algorithm uses the training dataset contains 50M descriptors, and 10K query vectors. The PQANNS was configured to use w = 4 and 8,192 centroids, which results in a precision of about 80%. This experiment has been carried out in a weak-scaling evaluation in which the dataset and number of nodes are increased at the same rate. As such, 2B SIFT vectors are stored per compute node, and 256B vectors are used in the experiment that employs 128 machines. This experiment is more appropriate than the typical strong-scaling in this application domain, because the indexing is expected to handle massive and increasing datasets.

Figure 8: Scalability of the distributed memory parallelization of the PQANNS in a weakscaling experiment using a dataset with 256B SIFT descriptors in the configuration of 128 nodes.

620

The execution times of the search as the dataset size and the number of nodes increase are presented in Figure 8. As presented, the application scaled very well and attained an efficiency of about 0.97 (97%) with 128 nodes as compared to the baseline execution using a single machine. The communication network traffic of the system as the number of nodes changes is presented in Figure 9. As shown, the traffic in the network is very low regardless of the number of computing nodes used, which is another promising aspect of the solution. This indicates that the algorithm would still be scalable if a much larger number of nodes is used.

21

Figure 9: Network traffic (MB/s) as the number of computing nodes is varied in a weak-scaling experiment and the dataset with 256B SIFT descriptors in the configuration with up to 128 nodes.

625

630

635

640

645

650

7. Conclusions and Future Work This work addresses the challenges of computing nearest neighbor search in large-scale datasets for online multimedia services. The requirements for this class of applications include the need of indexing large datasets, which are queried by users in an online scenario in which minimizing the execution time of each query is crucial to improve the user’s experience. The aforementioned demands are worsened in this case due to the variable nature of the query rates observed by these applications. Our approach to address these challenges includes the design and implementation of a distributed memory parallel PQANNS search engine. This parallelization efficiently exploits inter-/intra-node parallelism with a dataflow decomposition for distributed memory execution, which is coupled with a taskbased run-time system to adequately use multicore CPUs in each node of the machine. Furthermore, to react to variations in the application load, the taskbased run-time system implements new algorithms that tune the parallelism and task granularity on-the-fly to minimize the queries response times. We have experimentally evaluated our propositions under different scenarios. First, we compared PQANNS to a state of the art indexing solution (FLANN) in which PQANNS has shown to attain better precision vs. response time trade-offs. Further, we executed the application in configurations with varying loads on the system to analyze the ability of the proposed tuning approaches to minimize query response time. This experimental evaluation showed the ADAPT and ADAPT+G achieve similar or superior results as compared to the best static parallelism for all experiments. In cases with high loads, ADAPT+G could reduce the average query response times in about 2.43× as compared to the best static configuration. These performance gains are higher with load factors of 0.9 and 1.0 in which, respectively, the average query response times were reduced in 3.5× and 6.4×. We have also evaluated our distributed memory parallelization of PQANNS in a cluster with 128 nodes in which it was able to

22

655

660

scale very well (with an efficiency of about 0.97) and could handle a dataset with 256B SIFT descriptors. In our future work, we will evaluate the use of GPUs to accelerate the PQANNS in distributing memory settings under varying workloads. We will also evaluate the performance of other indexing data structures to replace the inverted list used in the original PQANNS. Additionally, a limitation of our work is that the designed parallel algorithm/implementation is not able to handle machines failures. This might be another interesting direction for future work. Acknowledgement

665

This work was partially funded by Fapemig, CNPq, CAPES, and by projects InWeb (MCT/CNPq 573871/2008-6), MASWeb (FAPEMIG-PRONEX APQ-01400-14), Capes/Brazil grant PROCAD183794 and EUBra-BIGSEA (H2020-EU.2.1.1 690116, Brazil/MCTI/RNP GA-000650/04).

References

670

[1] C. B¨ ohm, S. Berchtold, D. A. Keim, Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases, ACM Comput. Surv. 33 (2001) 322–373. doi:http://doi.acm.org/10.1145/502807.502809. [2] H. Jegou, L. Amsaleg, C. Schmid, P. Gros, Query adaptative locality sensitive hashing, in: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2008, pp. 825–828. doi:10.1109/ICASSP.2008.4517737.

675

[3] G. Teodoro, E. Valle, N. Mariano, R. Torres, J. Meira, Wagner, J. Saltz, Approximate similarity search for online multimedia services on distributed CPU-GPU platforms, The VLDB Journal (2013) 1–22doi:10.1007/s00778-013-0329-7. [4] J. H. Friedman, J. L. Bentley, R. A. Finkel, An Algorithm for Finding Best Matches in Logarithmic Expected Time, ACM Trans. Math. Softw. 3 (1977) 209–226. doi:10.1145/355744.355745.

680

685

[5] M. Muja, D. G. Lowe, Fast approximate nearest neighbors with automatic algorithm configuration, in: In VISAPP International Conference on Computer Vision Theory and Applications, 2009, pp. 331–340. [6] A. Beygelzimer, S. Kakade, J. Langford, Cover Trees for Nearest Neighbor, in: Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, ACM, New York, NY, USA, 2006, pp. 97–104. doi:10.1145/1143844.1143857. URL http://doi.acm.org/10.1145/1143844.1143857 [7] R. Weber, H.-J. Schek, S. Blott, A quantitative analysis and performance study for similaritysearch methods in high-dimensional spaces, in: VLDB, 1998, pp. 194–205.

690

[8] M. Muja, D. G. Lowe, Fast matching of binary features, in: Proceedings of the 2012 Ninth Conference on Computer and Robot Vision, CRV ’12, 2012, pp. 404–410. doi:10.1109/CRV. 2012.60. [9] A. Gionis, P. Indyk, R. Motwani, Similarity Search in High Dimensions via Hashing, in: Proceedings of the 25th International Conference on Very Large Data Bases, VLDB ’99, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1999, pp. 518–529. URL http://dl.acm.org/citation.cfm?id=645925.671516

695

700

[10] H. Jegou, M. Douze, C. Schmid, Product quantization for nearest neighbor search, IEEE Transactions on Pattern Analysis and Machine Intelligence,doi:10.1109/TPAMI.2010.57. [11] G. Andrade, G. Teodoro, R. Ferreira, Online Multimedia Similarity Search with Response Time-Aware Parallelism and Task Granularity Auto-Tuning, in: 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2017, pp. 153–160. doi:10.1109/SBAC-PAD.2017.27.

23

705

[12] G. T. Gudhmundsson, L. Amsaleg, B. T. J´ onsson, M. J. Franklin, Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark, in: Proceedings of the 8th ACM on Multimedia Systems Conference, MMSys’17, ACM, New York, NY, USA, 2017, pp. 1–12. doi:10.1145/3083187.3083200. URL http://doi.acm.org/10.1145/3083187.3083200 [13] E. Valle, M. Cord, S. Philipp-Foliguet, High-dimensional descriptor indexing for large multimedia databases, in: Proceeding of the 17th ACM Conference on Information and Knowledge Management, CIKM, 2008, pp. 739–748. doi:http://doi.acm.org/10.1145/1458082.1458181.

710

[14] P. Ram, D. Lee, H. Ouyang, A. G. Gray, Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions, in: Advances in Neural Information Processing Systems (NIPS) 22 (Dec 2009), MIT Press, 2010. [15] M. Muja, D. G. Lowe, Scalable nearest neighbor algorithms for high dimensional data, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (11) (2014) 2227–2240. doi: 10.1109/TPAMI.2014.2321376.

715

720

[16] C. Silpa-Anan, R. Hartley, Optimised KD-trees for fast image descriptor matching, in: IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8. doi:10.1109/CVPR. 2008.4587638. [17] D. Nister, H. Stewenius, Scalable Recognition with a Vocabulary Tree, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, 2006, pp. 2161–2168. doi:10.1109/CVPR.2006.264. [18] A. Stupar, S. Michel, R. Schenkel, RankReduce - processing K-Nearest Neighbor queries on top of MapReduce, in: In LSDS-IR, 2010.

725

[19] B. Bahmani, A. Goel, R. Shinde, Efficient distributed locality sensitive hashing, in: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, ACM, 2012, pp. 2174–2178. doi:10.1145/2396761.2398596.

730

[20] J. Pan, D. Manocha, Fast GPU-based Locality Sensitive Hashing for K-nearest Neighbor Computation, in: Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS ’11, ACM, New York, NY, USA, 2011, pp. 211–220. doi:10.1145/2093973.2094002. URL http://doi.acm.org/10.1145/2093973.2094002 [21] M. Krulis, T. Skopal, J. Lokos, C. Beecks, Combining CPU and GPU architectures for fast similarity search, Distributed and Parallel Databases 30 (2012) 179–207. doi:10.1007/ s10619-012-7092-4.

735

740

[22] G. Teodoro, E. Valle, N. Mariano, R. Torres, W. Meira, Jr., Adaptive parallel approximate similarity search for responsive multimedia retrieval, in: Proc. of the 20th ACM Int. Conf. on Information and knowledge management, CIKM ’11, 2011. URL http://doi.acm.org/10.1145/2063576.2063651 [23] X. Yang, Y. Hu, A landmark-based index architecture for general similarity search in peer-topeer networks, in: IEEE International Parallel and Distributed Processing Symposium, 2007. IPDPS 2007., 2007, pp. 1–10. doi:10.1109/IPDPS.2007.370230. [24] L. Cayton, Accelerating Nearest Neighbor Search on Manycore Systems, in: International Parallel and Distributed Processing Symposium (IPDPS 2012), 2012, pp. 402–413. doi:10. 1109/IPDPS.2012.45.

745

[25] The Message Passing Interface (MPI). URL http://www-unix.mcs.anl.gov/mpi/ [26] J. Johnson, M. Douze, H. J´ egou, Billion-scale similarity search with GPUs, arXiv preprint arXiv:1702.08734, 2017.

750

[27] A. Wakatani, A. Murakami, GPGPU implementation of nearest neighbor search with product quantization, in: 2014 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), IEEE, 2014, pp. 248–253. doi:10.1109/ISPA.2014.42.

755

[28] D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg, Indexing and Searching 100M Images with Map-reduce, in: Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval, ICMR ’13, ACM, New York, NY, USA, 2013, pp. 17–24. doi:10.1145/ 2461466.2461470. URL http://doi.acm.org/10.1145/2461466.2461470

24

[29] F. Blagojevic, D. S. Nikolopoulos, A. Stamatakis, C. D. Antonopoulos, M. Curtis-Maury, Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems, Parallel Comput. 33 (2007) 700–719.

760

765

[30] M. Curtis-Maury, J. Dzierwa, C. D. Antonopoulos, D. S. Nikolopoulos, Online Powerperformance Adaptation of Multithreaded Programs Using Hardware Event-based Prediction, in: Proceedings of the 20th Annual International Conference on Supercomputing, ICS ’06, ACM, New York, NY, USA, 2006, pp. 157–166. doi:10.1145/1183401.1183426. URL http://doi.acm.org/10.1145/1183401.1183426 [31] Y. Ding, M. Kandemir, P. Raghavan, M. J. Irwin, Adapting Application Execution in CMPs Using Helper Threads, J. Parallel Distrib. Comput. 69 (9) (2009) 790–806. doi:10.1016/j. jpdc.2009.04.004. URL http://dx.doi.org/10.1016/j.jpdc.2009.04.004

770

[32] Z. Wang, M. F. O’Boyle, Mapping Parallelism to Multi-cores: A Machine Learning Based Approach, SIGPLAN Not. 44 (4) (2009) 75–84. URL http://doi.acm.org/10.1145/1594835.1504189

775

[33] A. Raman, H. Kim, T. Oh, J. W. Lee, D. I. August, Parallelism Orchestration Using DoPE: The Degree of Parallelism Executive, in: Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’11, ACM, New York, NY, USA, 2011, pp. 26–37. URL http://doi.acm.org/10.1145/1993498.1993502 [34] R. H. Arpaci-Dusseau, E. Anderson, N. Treuhaft, D. E. Culler, J. M. Hellerstein, D. A. Patterson, K. Yelick, Cluster I/O with River: Making the Fast Case Common, in: Input/Output for Parallel and Distributed Systems, 1999.

780

[35] M. D. Beynon, T. Kurc, U. Catalyurek, C. Chang, A. Sussman, J. Saltz, Distributed processing of very large datasets with DataCutter, Parallel Comput. 27 (11) (2001) 1457–1478. doi:http: //dx.doi.org/10.1016/S0167-8191(01)00099-0. [36] G. Teodoro, D. Fireman, D. Guedes, W. M. Jr., R. Ferreira, Achieving Multi-Level Parallelism in the Filter-Labeled Stream Programming Model, International Conference on Parallel Processing 0 (2008) 287–294. doi:http://doi.ieeecomputersociety.org/10.1109/ICPP.2008.72.

785

790

[37] R. Ferreira, W. M. Jr., D. Guedes, L. Drummond, B. Coutinho, G. Teodoro, T. Tavares, R. Araujo, G. Ferreira, Anthill:a scalable run-time environment for data mining applications, in: Symposium on Computer Architecture and High-Performance Computing (SBAC-PAD), 2005. [38] D. G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, Int. J. Comput. Vision 60. doi:10.1023/B:VISI.0000029664.99615.94. [39] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, C. Schmid, Aggregating Local Image Descriptors into Compact Codes, IEEE Trans. Pattern Anal. Mach. Intell. 34 (9) (2012) 1704– 1716. doi:10.1109/TPAMI.2011.235. URL http://dx.doi.org/10.1109/TPAMI.2011.235

795

800

[40] T. S. Jaakkola, D. Haussler, Exploiting Generative Models in Discriminative Classifiers, in: Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems(NIPS), MIT Press, Cambridge, MA, USA, 1999, pp. 487–493. URL http://dl.acm.org/citation.cfm?id=340534.340715 [41] M. Douze, H. J´ egou, The Yael Library, in: Proceedings of the 22nd ACM International Conference on Multimedia, MM ’14, ACM, New York, NY, USA, 2014, pp. 687–690. doi: 10.1145/2647868.2654892. URL http://doi.acm.org/10.1145/2647868.2654892

25

*Author Biography & Photograph



Guilherme Andrade holds a degree in Computer Science from the Federal University of São João del Rei (2012) and a Master's Degree in Computer Science at Federal University of Minas Gerais (2014). He currently is a PhD candidate in Computer Science from the Computer Science Department (DCC) at the Federal University of Minas Gerais, working on research in the areas of high performance computing in heterogeneous architectures.



André Fernandes is an undergrad student in the Department of Computer Science at the Universidade de Brasília, Brazil. His main research interest is on large-scale similarity search applications in distributed environments.



Jeremias Gomes holds a degree in Computer Science from the Centro Universitário de Brasília and a Master’s degree in Computer Science from Universidade de Brasília, Brazil. He is currently a PhD candidate in Computer Science at the Universidade de Brasília, working on high-performance computing applied to biomedical informatics.



Renato Ferreira is an associate professor in the Department of Computer Science at Universidade Federal de Minas Gerais. His research focuses on compiler and run-time support for high performance computing and large, dynamic datasets. It involves both high performance, important issue from the applications

end-users' perspective, and high-level programming abstractions, important for the application domain developers. 

George Teodoro received his M.S. and Ph.D. degrees in Computer Science from the Universidade Federal de Minas Gerais (UFMG), Brazil, in 2006 and 2010. He is currently an assistant professor in the Computer Science Department at the University of Brasilia (UnB), Brazil. His primary areas of expertise include high performance runtime systems for efficient execution of biomedical and data-mining applications on distributed heterogeneous environments.