Simulation Modelling Practice and Theory 93 (2019) 148–163
Contents lists available at ScienceDirect
Simulation Modelling Practice and Theory journal homepage: www.elsevier.com/locate/simpat
Hybrid capacity planning methodology for web search engines Mauricio Marin ,a,d, Veronica Gil-Costa Carolina Bonacica ⁎
⁎,⁎,c
, Alonso Inostrosa-Psijasa,b,
T
a
DIINF, CITIAPS, University of Santiago, Chile Universidad Arturo Prat, Av. Arturo Prat 2120, Iquique, 1100000, Chile CONICET, National University of San Luis, Argentina d CeBiB, DIINF, University of Santiago, Chile b c
ARTICLE INFO
ABSTRACT
Keywords: Capacity planning Large scale systems Web search engines
Capacity planning studies are suitable for supporting decision making in management and operation of Web search engines deployed on large clusters of processors. Among many possibilities, they enable ensuring that a sufficient amount of computational resources are timely provisioned to efficiently deal with the ever changing streams of user queries. In this paper, we present a simulation based methodology devised to perform capacity planning in large scale Web search engines. It combines classical operational analysis formulae with discrete event simulation to significantly reduce the number of deployments that are evaluated to find an optimal assignment for a target workload. We experimentally evaluate our proposal for demanding cases such as service nodes with temporary failures. The results show that the proposed methodology is able to produce good quality results in practical running times.
1. Introduction A capacity planning study is a complex and time consuming task when the collection of services forming a large scale Web search engine are deployed and executed on commodity clusters of processors. Services are divided into partition and replica nodes to achieve the goal of keeping query response times and computational resources utilization below predefined upper bounds, and keeping the rate of fully processed queries per unit time (throughput) at the same level than the user query arrival rate. Each configuration of services can be represented as a n-dimensional tuple where each dimension represents the number of replicas or partitions for each service. Selecting the appropriate configuration then involves evaluating the performance of the system through all possible configurations in a search for a configuration that requires the smallest number of processors and satisfies a given target query throughput and also the bounds for query response time and utilization. In this paper we propose a capacity planning methodology to determine the amount of computational resources required by Web search engines and their deployment on clusters of processors to operate efficiently. We refer to large and complex search engines composed of distributed single purpose services engineered to support widely different computational requirements (usually based on one partition/replica service node per dedicated processor), where each processor is expected to be a multi-core system enabling multi-threading on shared memory data structures, and where message passing is applied among processors to enable parallel computation on the distributed memory.
⁎
Corresponding author. Corresponding author. E-mail addresses:
[email protected] (M. Marin),
[email protected] (V. Gil-Costa).
⁎⁎
https://doi.org/10.1016/j.simpat.2018.09.016 Received 13 June 2018; Received in revised form 27 September 2018; Accepted 29 September 2018 Available online 02 October 2018 1569-190X/ © 2018 Elsevier B.V. All rights reserved.
Simulation Modelling Practice and Theory 93 (2019) 148–163
M. Marin et al.
In addition, a number of strategies for caching, indexing, routing and other related query processing strategies and heuristics are properly combined in each service to increase performance and efficiently cope with sudden changes in the incoming user query traffic. Modeling the cost of each of those components is a challenging problem in its own merit and, from this baseline, determining the amount of computational resources required by each service and how to distribute them on the underlying processors exacerbates the complexity of capacity planning. Therefore, it is relevant for engineers in data centers and designers of new strategies for query processing to have tools for predicting the performance of these systems and determining the computational resources they require to be deployed in production. Given a target incoming query traffic, the goal is to determine the minimum number of partitions and replicas for each service and determine how to distribute them on the processors. In this context, capacity planning becomes a key tool to find a suitable services configuration during studies such as the design, sizing and updating of the software and hardware infrastructure required by the search engine to operate efficiently in production. The methodology proposed in this paper combines simulation with operational analysis formulae. To simplify modeling, the key features in the cost of search engine operations are identified and integrated into cost functions devised to determine the average cost of services. Parameters such as average service times and average query answer cache hits, are estimated from benchmark programs using representative query logs for a wide range of services configurations. In our methodology, we first execute a search algorithm where the operational analysis formulae, parameterized with the results obtained from light simulations, are used to quickly discard those configurations that will not satisfy the constraints imposed for the target performance goal. Then, we employ a simulator able to predict performance in a precise manner to simulate the services of the Web search engine running on the small set of configurations not discarded in the previous step. The aim is to determine the configuration reporting the highest performance and the smallest number of replicas and partitions. Finally, we use a graph partitioning heuristic to distribute the service replicas and service partitions on the processors so that communication and load balance are optimized. The remaining of this paper is organized as follows. In Section 2 we describe the architecture of Web search engines. In Section 3 we present previous related works. In Section 4 we present our proposed hybrid methodology for capacity planning of Web search engines. Section 5 details our experimental setup and evaluation. Conclusions follow in Section 6. 2. Web search engines Large-scale Web search engines (WSE) are commonly composed of a collection of services. Services are devised to quickly process user queries in an on-line manner. In general, each service is devoted to a single operation within the query process. Multi-threading is used to exploit multi-core processing on data stored in the processors holding the WSE services. A query submitted by a user goes through different stages. In Fig. 1, the Front-End Service (FS) is composed of several processing nodes or processors. Each FS node receives user queries (step 1 in Fig. 1) and sends back the top-k results to related systems (step 6) such as machine learning re-rankers, query answer builder servers and Ad servers. Note that these related systems are expected to work only with a very small constant-size subset of the Web documents that contains the query terms and thereby can potentially be part of the query answer. We refer to these systems as users. Each FS node tracks each received query which is linked to a timestamp value to ensure a valid response to the user. After a query arrives to a FS node, referred as FS fi below, we select a caching service (CS) node to determine whether the query has been previously processed (step 2) and contains a timestamp value within a time-to-live range that makes it a valid query answer for user delivery. In the CS, the goal is to maintain a set of previously computed query results in a limited-capacity cache memory. To balance the workload among the CS nodes and to support fault tolerance, a consistent hash-based ring can be used to distribute the CS nodes on
Fig. 1. Web search engine: Three main services (FS, CS and IS) and the steps executed to process queries. The IS keeps an intersection cache and the inverted index in main memory. 149
Simulation Modelling Practice and Theory 93 (2019) 148–163
M. Marin et al.
partition processors [1]. The CS is deployed in a cluster forming an array of P × R processors or caching service nodes. A simple LRU strategy can be used to achieve a reasonably high hit rate. The memory cache partition is performed by means of a distributed memory object caching system like Memcached [2], where one given query is always assigned to the same CS partition. Memcached uses a hash function with uniform distribution over the query terms to determine the partition that holds the entry for the query. To increase throughput and to support fault tolerance, each partition is replicated R times. Therefore, at any given time, different queries associated with the same partition can be solved by different replicas of the same partition. If the query is cached, the CS node sends the top-k document IDs to the FS node (step 3). Afterwards the FS fi sends the query results to users (step 6). If the query is not found in the cache, the CS node sends a hit-miss message to the FS fi (step 3). At this point, the FS node sends an index search request to the index service (IS) cluster (step 4). The IS computes by means of a ranking function the K documents that are the most similar ones to the query and send them to the requesting FS fi (step 5). The IS contains an index built from a large set of Web documents and it allows the fast mapping among query terms and documents. The index is used to speed up the selection of the documents containing the query terms and contains pre-computed data to reduce the cost of the ranking function. The number of documents and the index are usually huge and thereby they must be evenly distributed onto a large set of processors in a sharing nothing fashion. The index is kept compressed in the distributed main memory held by all the processors. Thus, for the IS setting, the standard cluster architecture is an array of P × R processors or index search nodes, where P indicates the level of document collection partitioning and R the level of document collection replication. The rationale for this 2D array is as follows: each query is sent to all the P partitions where in each partition one replica is selected and, in parallel, the P local top-k document IDs are determined. These local top-k results are then merged by the FS node fi to determine the global top-k document IDs. The index stored in each IS node is the so-called inverted index [3–8] (or inverted file) which is a data structure used by wellknown WSEs. It is composed of a vocabulary table (which contains the V distinct relevant terms found in the document collection) and a set of posting lists. The posting list for term c ∈ V stores the identifiers of the documents that contain the term c, along with additional data used for ranking purposes. To solve a query, one must fetch the posting lists for the query terms, compute the intersection among them, and then compute the ranking of the resulting intersection set using algorithms like WAND [9]. These algorithms use dynamic pruning techniques to avoid processing complete lists. In particular, the WAND [9] and its variant the BM-WAND [10] are based on two levels of processing. In the first level, some potential documents are selected as results using an approximate evaluation. Then, in the second level those potential documents are fully evaluated (e.g. using the BM25 or vector model) to obtain their scores. A heap keeps the current top-k documents where in the root is located the document with least score. The root score provides a threshold value which is used to decide the full score evaluation of the remaining documents in the posting lists associated with the query terms. This scheme allows skipping many documents that would have been evaluated by an exhaustive algorithm. Therefore, the cost of the query ranking process is not linear with respect to the length of the associated posting lists and it depends on the specific query terms. Thus, the IS solves a query by fetching the posting lists for the query terms, computing the intersection among them, and then computing the ranking of the resulting intersection set using a ranking algorithm (the WAND operation is able to perform both operations at the same time but there are other alternative ranking functions that require separate intersection). Each IS node sends the local top-k document IDs to a FS node (step 5 in Fig. 1), and the FS node merges all of the results coming from the IS nodes. The IS can also keep an intersection cache memory to store precomputed frequently used posting lists intersections from previous queries. Services are deployed on large clusters commodity processors (usually one replica/partition per processor) that communicate through a fast network of switches connected in a Fat-Tree scheme [11]. A Fat-Tree is organized in three levels. At the bottom, services are connected to edge switches, in the middle, we have the aggregation switches and at the top, we have the so-called core switches. This kind of network allows achieving fault tolerance and a high level of parallelism in message traffic to avoid congestion since messages can follow different paths in the Fat-Tree to reach the same given destination. 3. Previous work There are several performance models for capacity planning of different systems [12–15]. The Amazon company has private capacity planning methodologies, however, very little is known about the methodology, only that they use machine learning techniques to predict the growth of the needs of their data centers [16]. The Amazon Elastic Compute EC2, is devised for cloud computing services, which gives the user the possibility to select the instances to deploy the application. Amazon Auto-Scaling is a service that allows the user to define several instances according to ranges of workloads. E.g., the user can define that for a given average load in the range [0%, 50%] he/she requires X instances, and for the range (50%, 100%] he/she requires X + 10 instances. Similarly, Google uses Machine Learning to build a capacity planning focusing on optimizing the use of energy (Power Usage Efficient, PUE). In [17], Gao presents an artificial intelligence algorithm -which by default, generates a black box model without knowing the relationship between the elements of the system- trained with the historical data of use of mechanical and electrical equipment. In [15] Kejariwal and Allspaw described the capacity planning process using Flickr as an example. They define four steps: (1) setting the goals, (2) use metrics to find the limits of hardware for supporting bursts of requests (i.e. with the Ganglia software), (3) predicting the trends of the metrics, (4) deploy and manage the new resources acquired according to the prediction calculated in previous steps and (5) manage the capacity of the resources via autoscaling. Netflix uses the Scryer [40] autoscale engine which works based on the historical data of users (like user usage patterns), linear regression and the fast Fourier transform to estimate the number of resources. All calculations are based on the workload. 150
Simulation Modelling Practice and Theory 93 (2019) 148–163
M. Marin et al.
The work in [18] presented a performance model for searching large text databases by considering several parallel hardware architectures and search algorithms. It examines three different document representations by means of simulation and explores response times for different workloads. The authors in [19] presented an inquery simulation model for multi-threading distributed information retrieval (IR) systems. The work in [20] presented a framework based upon queuing network theory for analyzing search systems in terms of operational requirements such as response time, throughput, and workload. The proposed model uses synthetic workloads. A key feature in our context is to properly consider user behavior using actual query logs representing the user’s behavior. Notice that in practical studies, one is interested in investigating the system performance by using an existing query log. This log has been previously collected from the queries issued by the actual users of the search engine. Moreover, Chowdhury and Pass [20] assume perfect balance between the service times of the nodes and it does not verify the accuracy of their model with actual experimental results. Cacheda et al.in [21] and later in [22], simulated different architectures of a distributed information retrieval system. Through this study it is possible to approximate an optimal architecture. However, they assume that service times are balanced when the nodes handle a similar amounts of data. This work was extended in [23] to study the interconnection network of a distributed information retrieval system, and later in [24] to estimate the communication overhead. In [25] Jiang et al. presented algorithms for capacity analysis for general services in on-line distributed systems. The work in [26] presented the design of simulation models to evaluate configurations of processors in an academic environment. The work presented in [27] proposed a mathematical algorithm to minimize the resource cost for a server-cluster. In our application domain, mathematical models are not capable of capturing the dynamics of user behavior nor temporarily biased queries on specific query topics. The work in [27] was extended in [28] to include mechanisms which are resilient to failures. The work presented in [29] and continued in [30] characterizes the workload of the search engines and use the approximate MVA algorithm [13,31]. However, this proposal is evaluated on a very small set of IS nodes, with only eight processors. The effects of asynchronous multi-threading are not considered as they assume that each processor serves load using a single thread. This work also does not consider the effects caused in the distribution of inter-arrival times when queries arrive at more than one FS node. They also use the harmonic number to compute average query residence time at the index service (IS) cluster. This can be used in WSE systems such that every Pi index partition delivers its partial results to a manager processor (front service node in our architecture) and stays blocked until all P partitions finish the current query. This is an unrealistic assumption since current systems are implemented using asynchronous multi-threading in each processor/node. In the system used in our work, the flow of queries is not interrupted. Each time an IS partition finishes processing a query, it immediately starts the next one. In [32], Gil-Costa et al. simulated the performance of vertical search engines through the simulation of a hierarchical timed CPN model under some query traffic. The outcome of the simulation is the average query response time and workload of processors. The proposed model can be transformed to deal with technical issues like the underlying communication network between CS, IS and FS nodes and what the possible consequences on the performances can be. However, this works has some limitations regarding the number of CS partitions (it cannot be parameterized) and the time required to run the simulations exponentially grows with the number of services. The limitations of previous attempts to model Web search engines show that this problem is not simple to solve. Our proposal is resorting to a more complex methodology but more powerful in terms of its ability to predict the performance of these complex systems. 4. Hybrid capacity planning methodology 4.1. Proposal As explained above, a Web search engine is composed of different services such as the front-service (FS), cache-service (CS) and index-service (IS). The running time cost of each service is dominated by a few primitive operations executed to calculate the top-k documents for each user query. We propose to exploit this feature to develop a methodology for capacity planning for Web search engines. We model the actual hardware and system software by considering only the relevant features that affect the overall cost of computations. To describe our proposal (w.l.o.g.), we specifically focus on the search engine architecture presented in Fig. 1. The ideas presented can be extended to other architectures requiring partition and/or replication of additional services on cluster of processors. Our goal is to quickly find safe services configurations as defined in the following. A services configuration is represented as a tuple < FSr, ISp, ISr, CSp, CSr > where FSr is the number of FS replicas, ISr represents the number of IS replicas and ISp the number of IS partitions, and CSr and CSp the number of replicas and partitions for the CS respectively. A safe services configuration is defined as the minimum amount of FSr, CSp, CSr, ISp and ISr required to support a given incoming query traffic provided it satisfies the following constrains: (i) Individual query response times must be kept below an upper-bound R′ to prevent users from experiencing long waiting times; (ii) processors utilizations have to be below an upper-bound (between 40% and 80% at most) to support sudden rises in the incoming query traffic; and (iii) query throughput (i.e., fully solved queries per unit time) must be kept at the same value than the incoming query arrival rate. Each constraint represents a different dimension in the capacity planning problem as shown in Fig. 2. The approach proposed in this paper uses discrete event simulation which provides a way to overcome the difficulties of getting too deep into hardware details [19,21,22]. It is more realistic but more expensive in terms of development and production of results [13]. Each simulation for large systems can take several minutes to complete. The execution time of each simulation tends to grow as we increase the number of services and hardware components [33]. Our methodology combines simulation and operational analysis 151
Simulation Modelling Practice and Theory 93 (2019) 148–163
M. Marin et al.
Fig. 2. Dimensions of the problem: The utilization level of each service is kept below 40% to support unexpected rises in query traffic, where the query response time is bounded by R′ provided that the query throughput is kept at the same level than the query arrival rate.
for open systems (OAOS) formulae to significantly reduce the total number of simulations. In Fig. 3, we show the steps involved in the proposed methodology. At the left side, we have all possible configurations for our application domain. That is, each black point represents a tuple with different number of FS replicas, CS replicas and partitions, and IS replicas and partitions. The first step aims to reduce the number of possible configurations by feeding OAOS formulae with the results of light simulations of each individual service of the Web search engine. For each service (FS, CS and IS), only the main operations at a single processor are simulated. Thus the light simulations are simpler and faster than the complex ones executed in the next steps of the methodology because they avoid simulating running time consuming aspects such as the communication network, multi-core processors architecture, message passing among services and multithreading within services. Query contents, overall query response times of individual components (e.g., processing time for a specific document ranking algorithm), and query inter-arrival times are obtained from traces of actual executions (query logs) and benchmark programs written to determine the cost of relevant operations. In this way, our simulation based methodology is able to capture (i) the unpredictable behavior of users which is done by simulating with input data obtained from real queries logs that reflect the characteristics and behavior of users; (ii) the temporarily biased queries on specific topics which is done by means of the user query logs; (iii) the imbalance of query processing across different services which is related to the length of the distributed posting lists and document ranking algorithm operation; (iv) the cost of software and hardware which are measured through benchmark programs previously executed on samples of the actual data collection; and (v) the competition for accessing both the system software and hardware resources required to process the queries which is performed by process oriented discrete event simulation as described below. The light simulations of individual single-processor services are used to calculate metrics such as the service time, utilization level, inter-arrival rate of queries, and routing probabilities for the OAOS formulae. Then, we execute a search algorithm that implements the OAOS formulae – described in Section 4.3 – to discard all of the services configurations that do not qualify as safe configurations (step 2 of Fig. 3).
Fig. 3. The proposed methodology combines discrete event simulation and classical OAOS formulae. Step (1) reduces the space of all possible services configuration by combining light simulations (isolated single processor services) and OAOS formulae. Step (2) selects a safe configuration and expands the search space around this configuration. Step (3) executes fully complex simulations including the network communication, multithreading, message passing and so on, and also including all involved services to select a safe services configuration. Step (4) applies a graph partition based allocation algorithm for the deployment of replicas and partitions of services on the cluster processors. 152
Simulation Modelling Practice and Theory 93 (2019) 148–163
M. Marin et al.
In the third step of the methodology, we select the safe services configuration with the smallest number of processors. We expand the configuration space around the selected safe configuration in a range of [-5%, +5%]. Then these configurations are evaluated through full complex simulations (including the Fat-Tree [11] communication network, multithreading, message passing among services, etc.). Afterwards, again we select the configuration reporting the smallest number of processors. Finally, in the fourth step, we apply a graph partition allocation algorithm (cf., [34]) to define the final deployment of the services on the number of processors calculated in the previous step (this step can be embedded in the simulations wherein usually a simple allocation heuristic works well). 4.2. Simulation model The simulation model uses a processes and resources approach. Processes represent threads in charge of processing high cost operations executed in the Web search engine. Resources are shared artifacts such as posting lists, data structures for partial results, global variables, RAM memory, cores and processor caches, and communication interfaces and switches for the Fat-Tree network. Our simulator programs are implemented on top of the LibCppSim [35] library. This library manages the creation/removal of coroutines as well as the future event list. The library ensures that the simulation kernel grants execution control to co-routines in a mode of one co-routine at a time. Co-routines are activated following the sequential occurrence of events in chronological order. Co-routines represent process that can be blocked and unblocked at will during simulation by using the operations passivate(), hold() and activate(). When a hold(Δt) operation is executed, the co-routine is paused for a given amount of Δt units of simulation time representing the dominant cost of a task. Once the simulation time Δt has expired, the co-routine is activated by the simulation kernel. The dominant costs come from tasks related to ranking of documents, intersection of posting lists and merge of partial results. These costs are determined by benchmark programs implementing the same operations executed on single processors. Additionally, a co-routine executes a passivate() operation to stop itself, indicating it has paused its work. Finally, a co-routine in passivate state can be activated by another co-routine using the activate() operation. Fig. 4 shows the simulation of a few example steps executed by an index service (IS) node required to perform a document ranking operation. At left, the figure shows the operations executed by the IS node. The co-routine simulates a lock operation which is used to prevent from simultaneous accesses to the posting lists associated with the query terms. It then simulates a document ranking operation for the query terms and finally it simulates the respective unlock operation. At right, the figure shows the cost in simulation time of each operation. Competition for computational resources is simulated by further expanding the cost functions associated with each operation. The simulation of the lock and unlock operations requires the application of the passivate() and activate() operations on a list of co-routine pointers to provide exclusive access to the posting list. The document ranking simulation considers the fact that posting lists are large sections of contiguous memory that are split into processor cache lines to make them travel from the main memory to the L1 cache associated with the core serving the respective thread. The transfer of a cache line from one memory to another in the processor takes a certain amount of simulation time to account for the respective latency. Typically the implementation of search engines enforces the scheduling policy of one single thread per core to prevent from saturation at processor level. Thus the incoming queries are queued up at the assigned thread to receive service which is also reflected in the corresponding simulation. Queries compete for accessing the threads and through them they get access to processor resources. The sequence of operations across multiple services and processors form independent directed acyclic graphs (DAGs) for each query. In the corresponding simulation these DAGs are explicitly represented by event messages containing the respective query IDs so that they can be forked and joined to mimic the respective edges and vertices. The edges cause query processing delays in simulation time that depend on the co-occurring edges in the processors. The vertices cause message transfer delays that depend on the messages circulating in the Fat-Tree communication network [11].
Fig. 4. An example of a co-routine executing the operations of an index service. 153
Simulation Modelling Practice and Theory 93 (2019) 148–163
M. Marin et al.
Search_Safe_Configuration( MIN_IS p , MAX_IS p , MAX_CS p ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22:
for ( i = MIN_IS p ; i ≤ MAX_IS p ; i = i + 1 ) do ts_IS ← IS_service_time( i ) hit_IS ← Intersection_hit_ratio( i ) for ( j = 1; j ≤ MAX_CS p ; j = j + 1 ) do ts_CS ← CS_service_time( j ) hit_CS ← hit_ratio( j ) ts_FS ← FS_service_time( i ) R_IS ← min_max_utilization( ts_IS, hit_IS ) R_CS ← min_max_utilization( ts_CS, hit_CS ) R_FS ← min_max_utilization( ts_FS ) param = (ts_IS, hit_IS, ts_CS, hit_CS, ts_FS) for ( k = R_FS .min; k ≤ R_FS .max; k = k + 1 ) do for ( m = R_IS .min; m ≤ R_IS .max; m = m + 1 ) do for ( p = R_CS .min; p ≤ R_CS .max; p = p + 1 ) do if (Eval_Performance(param, k, i, m, j, p) ≤ R0 ) then SafeSet ← AddConfiguration(k, i, m, j, p) end if end for end for end for end for end for
Algorithm 1. Algorithm used to estimate the maximum and minimum number of replicas and partitions for each service provided that the constrains for processors utilization, query throughput and individual query response time are satisfied.
To simulate the Fat-Tree network, the messages are divided into packages of fixed size. Each package includes the data, a header with sender and receiver identifiers, and the number of packages forming the message. All input packages go to the same input queue. We keep an output queue (or output port) for each device connected to the network switch. Packages forming a message can be sent in parallel through different output queues. Benchmark programs are devised to evaluate the cost of different communication patterns such as multicast, broadcast, and point-to-point messages. 4.3. Safe services configuration search algorithm In Algorithm 1 we present our search algorithm which, for a given incoming query traffic, receives three parameters called MAX _ISp, MAX _CSp and MIN _ISp . The value of MAX _ISp is the maximum number of IS partitions and MAX _CSp is the maximum number of CS partitions. They represent saturation points. Beyond these maxima it is not possible to improve the performance of the system. In other words, it is not possible to reduce the query processing time in each index service node partition neither to increase the number of cache hits in the CS. These values are obtained by running benchmark programs and light simulations of the IS and CS single-processor services in isolation. MIN _ISp is a parameter defining the minimum number of index node partitions required to support the entire inverted index in main memory (current Web search engines keep all data structures in main memory to avoid delays from secondary memory). For each number of IS partitions (line 1) and each number of CS partitions (line 4) the algorithm computes the upper and lower bounds for the number of replicas of each service. First, for a given number of IS partitions i, we set the service time of the IS in line 2 and the hit ratio of the intersection cache in line 3. These values are obtained through the execution of light simulations. In the same way, we obtain the service time and hit ratio for j CS partitions in lines 5 and 6. Finally, in line 7 we obtain the service time for the FS which takes into account the interaction with the other services. In Section 5 we show that the service time of the FS depends on the number of IS and CS nodes and therefore isolating this service can reduce the precision of the results obtained by the search algorithm. The min_max_utilization() function (lines 8–10) applies operational analysis over each service to determine the minimum (R _FS. min ) and maximum (R _FS. max ) number of replicas that satisfy the constraint of keeping the query throughput at the given query arrival rate. We want to obtain the minimum number of replicas for which the utilization is kept below 80% and the maximum number of replicas for which the utilization is above 20%. In case the function estimates as 1 the minimum number of replicas for a service, we increase this value to support fault tolerance. The Eval_Performance function, in line 15, uses OAOS formulae to determine the query response times for each services configuration. These formulae are described in the next section. If the estimated query response time is below an upper-bound R′, the respective services configuration is safe. Finally, all services configurations evaluated 154
M. Marin et al.
Simulation Modelling Practice and Theory 93 (2019) 148–163
Fig. 5. Queuing network for a Web search engine with three services (FS, CS and IS) where of a cache hit for queries in the CS.
= X 0 is the query arrival rate and hit is the probability
as safe ones are stored in the SafeSet container (line 16). From the set of safe services configurations not discarded by the algorithm, we select the one with the smallest number of processors. As explained above, we expand the configuration space around the selected safe configuration in a range of [−5%, +5%]. Then these configurations are evaluated through full complex simulations to find the best one in terms of the smallest total number of processors satisfying the constrains for processors utilization, query throughput and individual query response time. 4.4. Performance model Our capacity methodology relays on a queuing-based analytical model to estimate the average query response time. We assume a system where the number of replicas and partitions are sized to obtain a query throughput rate (X0) similar to the query arrival rate (λ), namely X0 = . In Fig. 5, queries arrive to the FS assuming a negative exponential distribution. The FS nodes process queries with an average service time ts _FS (average service time per query taken as a function of the number of IS partitions since each query is always sent to a single CS partition). Then, they send those queries to the CS. The service time within the CS is also a constant value ts _CS because a constant time data structure is used to hold the previously computed query answers. Every CS node checks whether each query was previously solved and if its results are stored in the cache memory. Regardless the CS results (hit or miss) queries are sent back to the FS nodes. Queries reporting a cache hit are delivered to users. The remaining X0 ·(1 hit ) queries are sent to the IS where hit is the cache hit probability. Thus for the target throughput X0, overall, X0 ·(1 hit ) queries per unit time are solved by the IS and X0 · hit queries per unit time are solved by the CS. We aim to keep the utilization level U of each node below 80% to support peaks in query traffic and above 20% to reduce the number of idle processors. Using operational analysis formulae we compute the utilization U = ( · S )/(c·R), where R is the number of replicas and c is the number of threads per processor [36]. Therefore, c · R is the total number of services in the system. S is the average service time and λ is the query arrival rate. The query arrival rate is computed for each service as follows. From Fig. 5 the total rate of incoming queries λFS for the FS can be hit ) = (3· X0 ) (X0 ·hit ) . Then, given a service time SFS = ts _FS we obtain calculated as FS = X 0 + X 0 + X 0 ·(1 UFS = ( FS ·ts _FS )/(c· RFS ) where RFS is the number of replicas for the FS which is calculated using Eq. (1). For the IS and CS we apply the same formula using IS = X0 ·(1 hit ), S = ts _IS and CS = X0 , S = ts _CS respectively. Then, we solve Eq. (1) to obtain the number of replicas for the lowest utilization (Ul = 20%) and for the highest utilization (Uh = 80%). For each service this provides a range where to search for safe services configurations. For instance, if we want to estimate the maximum number of replicas for the FS (R _FS. max in line 11 of Algorithm 1), we set Ul = 20% in Eq. (1). If we want the minimum number of replicas R _FS. min we apply the highest utilization (Uh = 80%). The min_max_utilization() function of Algorithm 1 (lines 8–10), uses Eq. (1) to compute the maximum and minimum number of replicas per service. (1)
Rj = (1.0/ U )*(1.0/c )*( j ·tsj )
Once we compute the interval borders R _j. max and R _j. min for a service j ∈ {FS, CS, IS}, we calculate the set of number of replicas R _j. x (defined in Eq. (1)) with x in the range [R _j. min, R _j. max ], and for each x we calculate the utilization Uj with the following formula:
m = (1.0/ Ul 1.0/ Uh)/(Rmax Rmin) U = 1.0/(1.0/ Uh + m*(Rx Rmin))
(2)
The value of the slope m is used to re-compute the utilization according to the current number of replicas R _j. x in the interval [R _j. min, R _j. max ]. We assume that the utilization decreases linearly with the number of replicas as shown in Fig. 6. Due to the fact that all partitions in each service perform the same operations, the average effect of the number of partitions in these formulae is 155
Simulation Modelling Practice and Theory 93 (2019) 148–163
M. Marin et al.
Fig. 6. Computing utilization for Rj replicas. The y-axis shows the utilization level low (Ul) and high (Uh). The x-axis shows the number of replicas.
introduced by the service time S = ts _j . The query response time T = W + tsj is estimated as the average waiting time W plus the average service time ts _j, using the Eval_Performance() function of Algorithm 1 line 14. An approximation for W (G/G/c) is [13]:
(cv 2 + cvs2) ErlangC () · a (Rj ·(1 Uj ))/ tsj 2
W (G, G, c )
(3)
where and represent the coefficients of variation of the inter-arrival and service time respectively. Rj is the number of replicas for service j and Uj the utilization computed in Eq. (2). There are alternative approximations for W (G/G/c) in the technical literature like [37,38], but the approximation of Eq. (3) is simple and adequate for our capacity planning problem. The values for cva2 and cvs2 are adjusted by running small simulations of each service in isolation. Given RFS front services and the utilization UFS previously obtained with Eq. (2), we estimate the query response time of the FS for a continuously incoming stream of queries using Eq. (4). This service receives queries from users, from the CS and (1.0 hit ) queries from the IS. Then, for each incoming query, we use the service time ts _FS obtained from benchmark programs and the average waiting time obtained from Eq. (3).
cva2
cvs2
tFS = (2.0 + (1.0
hit ))· ts _FS +
cv 2 + cvs2 ErlangC () · a (RFS ·(1.0 UFS ))/ ts _FS 2.0
(4)
Eq. (5) presents the query response time estimated for the CS. This service receives only the queries arriving from users. The service time ts _CS is obtained from benchmark programs and using RCS replicas as follows:
tCS = ts _CS +
ErlangC () cv 2 + cvs2 · a (RCS ·(1.0 UCS ))/ ts _CS 2.0
(5)
Eq. (6) presents the query response time for the IS. This service receives (1.0 hit ) queries from the FS (queries reporting a cache miss in the CS). The service time is denoted by ts _IS and is obtained from benchmark programs. The value of RIS contains the number of replicas for this service.
tIS = (1.0
hit )· ts _IS +
cv 2 + cvs2 ErlangC () · a (RIS ·(1.0 UIS ))/ ts _IS 2.0
(6)
Finally the system query response time (QRT) is given by: (7)
QRT = tFS + tCS + tIS The function Eval_Performance() applies the above formulae to return QRT. 5. Evaluation 5.1. Experimental setting
The data presented in the experiments below consists of a 50.2 million document corpus TREC ClueWeb09 (category B1). We have indexed this corpus using the Terrier IR platform2. We selected the first 20,000 queries from the TRECMillion Query Track 2009. We pruned the index to keep only the data related to the terms of the query log. We executed queries against this index to get the top-k 1 2
http://www.lemurproject.org/clueweb09.php/. http://terrier.org/. 156
Simulation Modelling Practice and Theory 93 (2019) 148–163
M. Marin et al.
Fig. 7. Cache hits obtained with different number of partitions in (a) the caching service and (b) in the intersection cache of the index service.
documents for each query. We inserted these traces into the discrete-event simulators to evaluate our performance metrics. In the experiments, we show results normalized to 1 in order to better illustrate the comparative performance. We divide all values by the observed maximum in each case. We run a set of benchmark programs to obtain the service times, cache hits and coefficients of variation required by the OAOS formulae. First, we evaluate the saturation points for the CS. In Fig. 7(a) the y-axis shows the hit ratio and the x-axis shows the number of Memcached partitions from 1 to 200,000. Each partition keeps a cache memory ranging from 50K to 8000K entries for queries. Notice that, a larger cache allows to keep more pre-computed top-k results for queries. With the largest cache size (8000K) we report the maximum hit ratio with at least 400 CS partitions. On the other hand, when the CS nodes have a small cache size, they perform more evictions in the cache. Consequently, more partitions than 400 are required to achieve the highest hit ratio. Fig. 7(b) shows the hit ratio for the intersection cache in the IS nodes. We set the size of the inverted index to fit into 10 IS nodes (MIN _ISp = 10 ). As we increase the number of IS partitions, the index size is smaller, and the remaining main memory is used to store the results of the intersections of the posting lists for the query terms. Thus, the intersection cache size depends on the memory available in each node. We assume that for 10 IS partitions there is no space in main memory for the intersection cache. In Fig. 7(b) we show that the maximum hit ratio is obtained with almost 160 IS partitions. We use the WAND [9] algorithm to perform document ranking in the IS. We run a WAND benchmark program to obtain the average running time required to solve a query in the IS. Query running times reported by the WAND algorithm can be divided into two groups: (1) the time required to update a top-k heap, and (2) the time required to compute the similarity between the query and a document. The heap is used to maintain the local top-k document scores in each index service node. The similarity between the query q and a document d is given by a BM25 based score(d, q) function. In Fig. 8(a), the x-axis shows the query identifier and the y-axis shows the time required to update the heap and the time required to compute the similarity between the documents and a given query. The results show that the time required to compute the similarity is dominant over the time required to update the heap. Fig. 8(b) at left shows the average number of heap updates and average similarity computations per query. Each query performs
Fig. 8. Wand operator: (a) Time required to update the heap and to compute the score(q, d) (b) number of heap updates and similarity computations, and its variance. 157
Simulation Modelling Practice and Theory 93 (2019) 148–163
M. Marin et al.
Fig. 9. Benchmarks on the FS with 30 to 80 nodes when we vary the number of CS nodes. (a) Inter-arrival time in seconds and (b) service times in seconds.
0.1% heaps updates of the total number of operations with P = 32 and top-10 document ID results. This percentage increases to 1% with P = 256 because the size of the posting lists is smaller, and the WAND algorithm can skip more similarity computations. For a larger top-k the number of heap update operations increases. Also, the number of heap updates decreases linearly with the number of processors, but the number of similarity computations is reduced by increasing the number of processors. The results of Fig. 8(b) at right show the variance of the number of operations performed by the WAND algorithm. The variance decreases when we increase the number of IS partitions because the posting lists are smaller and in the WAND algorithm tends to skip more documents in each partition. Thus, the IS nodes perform fewer comparisons and almost the same amount of work for large number of partitions. Fig. 9(a) shows the average inter-arrival time of messages at the FS nodes. These are messages containing first arrival queries, cached results for previous queries from the CS, and local top-k results from the IS partitions. The x-axis shows the number of FS nodes participating, which ranges from 30 to 80 FS nodes to guarantee a query throughput equal to X0 = 3000 . The same criteria was used to determine the number of resources for the CS and IS. The y-axis shows the number of CS partitions and the z-axis shows the average inter-arrival time reported by the FS nodes. The number of IS partitions was set constant to 20 with 130 replicas each. The inter-arrival time increases with more CS partitions because more cache hits are reported by the CS and therefore fewer queries are sent to the IS which reduces the amount of messages generated by the IS. Thus, the flow of processed queries sent from the IS to the FS is lower. Fig. 9(b) shows the average service time reported by the FS nodes for the same services configuration. Service time tends to decrease with more CS partitions, due to more cache hits are reported and less merging operations upon the partial results from the IS are performed by the FS nodes. Fig. 10 shows that the IS presents a similar behavior for the message inter-arrival time. This is because the CS has a direct impact on the stream of queries sent to the IS. The query service time for the IS depends on the number of partitions. The results presented below show that the service time cannot be further reduced beyond 140 partitions. Table 1 shows with more details the behavior of the IS. This table shows how the IS coefficients of variation behave as we change the number of FS replicas or the number of CS partitions. Column cva shows that the variance of inter-arrival times is greater with more CS partitions. Namely, queries arrive between wider periods of times because the system reports more hit caches and therefore less queries are sent to the IS. On the other hand, the cvs has a slightly - but not significant- decrease with more CS partitions. This small variation is related to the management of queued queries. As expected, with more IS partitions the value of cvs is reduced because each IS node holds a smaller portion of the inverted index. Finally, increasing the number of FS replicas does not affect the service time of the IS but it increases the variance of inter-arrival times (although it is not significant). 5.2. Validation The results presented in this section were obtained on the HECToR cluster (http://www.hector.ac.uk) with 1856 nodes. Each node
Fig. 10. Inter-arrival time at the IS with different number of FS and CS nodes. 158
Simulation Modelling Practice and Theory 93 (2019) 148–163
M. Marin et al.
Table 1 Squared coefficients of variation of the inter-arrival and service time reported by the IS. FS
CSp
ISp
cva
cvs
30 30 30 30 30 30 60 60 60
1 5 10 1 5 10 1 5 10
10 10 10 50 50 50 50 50 50
0.000925045 0.00100738 ▲ 0.00101472 0.000913593 0.00101595 ▲ 0.00102267 0.00105952 0.00114248 ▲ 0.00115146
0.0336025 0.0328163 ▼ 0.0327771 0.00672008 0.0065623 ▼ 0.00655443 0.00672008 0.0065623 ▼ 0.00655443
has two 12-core AMD Opteron 2.1GHz Magny Cours processors, sharing 16GB of memory. Each 12-core socket is coupled with a Cray SeaStar2 routing and communications chip. The nodes are organized in 20 gabinets, with a total of 464 compute blades. Each blade contains four computing nodes. The results are obtained with a MPI library based implementation of a Web search engine with the FS, CS and the IS. We evaluate the performance in terms of throughput, query response time and services utilization. We executed 500,000 queries to evaluate our MPI implementation and our simulator. As a validation experiment and considering that the proposal in this paper is critically dependent on proper caching, we executed the different caching strategies discussed in [39] over our log. Fig. 11 shows that our results precisely mimics the performance figures obtained in [39]. This experiment shows that our query log was properly pre-processed and that our caching policies are correctly implemented and trained with the initial 60% set of queries. Proper tuning of the simulator cost parameters is also critical to the comparative evaluation of different services configurations. Fig. 12(a) shows how the IS service time decrease with more partitions. The results achieved with the simulator are very close to the ones reported by the real implementation of the search engine. With almost 140 partitions, the IS cannot improve its performance because the inverted index is very small, and the intersection cache almost reaches its maximum hit score (i.e., it is not possible to further reduce the service time for individual queries). Fig. 12(b) shows the throughput achieved by both the real implementation and the simulator. The x-axis stands for different configurations < FS, CSp, CSr, ISp, ISr > of partitions and replicas assigned to each service (FS, CS and IS), ranging from 115 processors < 12, 3, 1, 10, 10 > to 240 processors < 40, 20, 1, 30, 6 > . The results show that the simulator is able to predict all points in the curves of both Fig. 12(a) and (b) with good precision. 5.3. Formulae assessment In this section we evaluate the efficiency and effectiveness of our search algorithm for services configurations. First, we analyze the effectiveness of the algorithm when the partitions for the IS and the CS are kept fixed at MAX _ISp and MAX _CSp respectively, and we focus on the prediction of replicas for a given query arrival rate. Fig. 13(a) shows the average query response time reported by the search algorithm and full simulations over safe services configurations found to satisfy different query arrival rates. In this experiment we see that by using the formulae we get results very close to the results obtained with full simulations. The average error is 0.82% and the observed maximum error is 2%. In Fig. 13(b) we vary the number of CS and IS partitions. This figure shows two groups of query response times. The higher one corresponds to queries solved by the IS. The lower group of values corresponds to queries found in cache. The error reported by the algorithm is also very low, 2% at most. In Table 2 we analyze the efficiency of the search algorithm. In particular, we evaluate the percentage of services configurations
Fig. 11. Validation against third party results. 159
Simulation Modelling Practice and Theory 93 (2019) 148–163
M. Marin et al.
Fig. 12. Results achieved by an MPI implementation and its respective simulator: (a) Service time achieved with different number of IS partitions and (b) Query throughput.
Fig. 13. Full simulations vs. OAOS formulae. Estimating (a) the number of replicas and (b) the number of IS and CS partitions where each x-axis point is the configuration found for a different query arrival rate. Table 2 Percentage of services configurations not discarded by the proposed algorithm. MAX _ISp
MAX _CSp
#Configurations
Safe conf.
160 140 120
400 200 100
80.468.548 36.669.374 16.564.800
0,06% 0,14% 0,32%
discarded by the algorithm. For large numbers of partitions for the CS and the IS, only 0,06% of all possible configurations are considered as safe services configurations. For instance, with 120 partitions for the IS and 100 partitions for the CS we have at least 16.564.800 services configurations. After running the search algorithm, only 0,32% configurations are evaluated as safe services configurations through the execution of full simulations. Overall, the number of services configurations not discarded by the algorithm is below 1% of the complete space of possible configurations. As explained in Section 4.3, from the set of safe services configurations produced by the search algorithm evaluating the OAOS formulae we select the one with the least number of service nodes which is labeled as Sim in Fig. 14. The z-axis shows the total number of service nodes (CS+IS+FS). After increasing in a range of [−5%,+5%] the search space, we are able to refine the search to find a better configuration which is labeled as Optimal, namely the safe services configuration requiring the least total number of service nodes obtained through full complex simulations executed in this reduced search space. In Fig. 14(a), the search algorithm estimates the best configuration (Sim) with 529 service nodes ( < 12, 1, 7, 10, 51 > ). However, if we select only 49 IS replicas instead of 51 and decrease the number of FS by one, we reduce the total number of service nodes to 510 and we keep the utilization of each service node close to 40%. This second safe configuration is labeled as Optimal. Therefore, performing a second search of safe configurations through full simulations helps us to find a better option than the one 160
Simulation Modelling Practice and Theory 93 (2019) 148–163
M. Marin et al.
Fig. 14. The optimal services configuration for query arrival rates of (a) 2000 queries per second and (b) 3000 queries per second. The axes FS and IS show utilization values achieved by the respective services.
found with the OAOS formulae. In Fig. 14(b) the best configuration point reported by the algorithm involves 788 service nodes ( < 17, 1, 11, 10, 76 > ). After executing the full complex simulations over a 5% range around that point, we find the optimal configuration with 748 service nodes by reducing the number of IS replicas to 72. 5.4. Fault-tolerance evaluation In this section, we evaluate the behavior of the search engine using a safe configuration selected with the proposed methodology under situations in which service nodes fail temporarily. The objective is to evaluate how robust to severe failures is a given recommendation of services configuration produced by the methodology. In this case, service nodes leave service for a while and then they are re-incorporated. To this end, we divide the total simulation time in 16 intervals of time Δ. In the time interval Δ5, we inject faults into each service. Then, in the time interval Δ9 the service nodes that are out of service are put in production again. The rules to handle failures are as follows. Upon a failure of a FS node, queries being processed or queued inside the node are lost. Query results sent from the CS or from the IS to the failed FS node are lost as shown in the step (1) of Fig. 15. Upon the failure of a CS node, the FS selects another replica (steps (2) and (3)). Queries being inside the failed CS node are lost. If for a given partition CSi all replicas fail, the FS sends the query directly to the IS (steps 4 and 5). If an IS node fails before the FS sends the query, then the FS selects another replica (step (6) and (7)). If the failure of the IS node occurs after the FS sends the query, then partial results computed by that IS node are lost. Namely, the FS node waits for the partial results of P index service nodes. When queries are timed-out due to an IS node failure, the FS node merges P X partial results (X is the number of IS partitions failed for a given query) and sends an approximate result to the user. The query time-out is set to the maximum query response time achieved with no failures. In Figs. 16 and 17, the x-axis corresponds to time intervals of fixed size Δ. The utilization is computed as the fraction between the
Fig. 15. Web search engine supporting failures. 161
Simulation Modelling Practice and Theory 93 (2019) 148–163
M. Marin et al.
Fig. 16. (a) Average utilization and average query response time reported by the FS supporting fails. (b) Impact of the FS fails on the utilization of the CS and the IS.
Fig. 17. Utilization level and average query response time reported by (a) the CS and (b) the IS supporting failures.
busy time of the service nodes and the time interval. We evaluated a services configuration < 13, 1, 13, 10, 135 > which is a safe configuration satisfying the constrains listed in previous sections for a given incoming query traffic. In Fig. 16(a), the FS reaches a maximum utilization of 100% with 12 failures in the time interval [Δ5, Δ9]. An utilization of 80% is achieved with more than 4 failures in the FS. The FS tends to quickly re-establish an utilization level below 80% after the re-insertion of failed nodes. On the other side, query response time drastically increases with more than 6 failures (only half of the services are active). The query response time is re-established with less than 9 failures. Fig. 16(b) shows the impact of the FS failures on the utilization level of other services (CS, IS). Upon failures of FS nodes, the utilization reported by the other services is reduced because more queries are delayed in the active FS queues. But after the reinsertion of failed nodes, the utilization of the IS and CS tends to increase as they receive more queries. In Fig. 17, we stress the services workload by increasing the amount of failures in the same partition. That is, failures are injected in the nodes of the same CS or IS partition. In Fig. 17(a), the CS reports an utilization level higher than 80% with more than 8 failures. After x = 9, when the nodes are re-inserted, the utilization tends to be re-established quickly. Regarding query response time, it is drastically increased with more than 10 failures. Moreover, with 12 failures this measure cannot be re-established during the simulation time. Fig. 17(b) shows that the failures do not impact on the utilization level of the IS. This is because the utilization is measured among all index service nodes, a total of 1350 index service nodes. However, with more than 114 fails the query response time tends to increase. These results show that the proposed capacity planning methodology is able to determine services configurations that are fairly robust under the presence of demanding rates for service node failures. Most cases of severe performance degradation due to failures were observed for situations very unlikely to be found in practice. 6. Conclusions We have proposed a methodology for capacity planning which combines OAOS formulae with discrete-event simulation to 162
Simulation Modelling Practice and Theory 93 (2019) 148–163
M. Marin et al.
determine the efficient deployment of Web search engines on cluster of processors. Our proposal has been evaluated on a Web search engine composed of three services (FS, CS and IS) as a case of study. Overall, the proposed methodology is a suitable solution to a complex combinatorial problem related with the determination of the amount of computational resources assigned to each search engine service and their distribution on the cluster of processors. We have experimentally validated our discrete-event simulations against a real MPI implementation of the Web search engine. The results show that the simulations are able to predict performance with errors quite below 5%. We have also evaluated the effectiveness of the proposed OAOS formulae to reduce the search space of possible configurations of services (replicas and partitions). The formulae is executed through a search algorithm proposed in the paper. The results show that the search algorithm is able to significantly reduce the total number of full complex simulations that are necessary to execute in order to determine an optimal services configuration for the efficient search engine deployment (the configurations space is reduced to less than 1% of all cases that should have to be considered without the application of the algorithm). This ensures the achievement of good quality results in practical times for data center engineers. Acknowledgments This research was supported by the supercomputing infrastructure of the NLHPC Chile, partially funded by Comisin Nacional de Investigacin Cientȡfica y Tecnolgica Basal funds FB0001, Fondef ID15I10560, and partially funded by PICT 2014 2014-01146. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40]
C. Gómez-Pantoja, D. Rexachs, M. Marin, E. Luque, A fault-tolerant cache service for web search engines: RADIC evaluation, Euro-Par, (2012), pp. 298–310. B. Fitzpatrick, Distributed caching with memcached, J. Linux 2004 (2004) 72–76. R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, 1999. J. Zobel, A. Moffat, Inverted files for text search engines, J. CSUR 38 (2) (2006). M. Marin, V. Gil-Costa, High-performance distributed inverted files, CIKM, (2007), pp. 935–938. A. Moffat, W. Webber, J. Zobel, R. Baeza-Yates, A pipelined architecture for distributed text query evaluation, Inf. Retr. 10 (3) (2007) 205–231. J. Zhang, T. Suel, Optimized inverted list assignment in distributed search engine architectures, IPDPS (2007) 1–10. A. Moffat, W. Webber, J. Zobel, Load balancing for term-distributed parallel retrieval, SIGIR (2006) 348–355. A.Z. Broder, D. Carmel, M. Herscovici, A. Soffer, J.Y. Zien, Efficient query evaluation using a two-level retrieval process, CIKM, (2003), pp. 426–434. S. Ding, T. Suel, Faster top-k document retrieval using block-max indexes, SIGIR, (2011), pp. 993–1002. M. Al-Fares, A. Loukissas, A. Vahdat, A scalable, commodity data center network architecture, SIGCOMM 38 (2008) 63–74. M. Arlitt, D. Krishnamurthy, J. Rolia, Characterizing the scalability of a large web-based shopping system, J. ACM Trans. Internet Technol. 1 (2001) 44–69. D.A. Menasce, V.A. Almeida, L.W. Dowdy, Performance by Design: Computer Capacity Planning, Prentice Hall, 2004. H. Wang, K.C. Sevcik, Experiments with improved approximate mean value analysis algorithms, Perform. Eval. 39 (2000) 189–206. A. Kejariwal, J. Allspaw, The Art of Capacity Planning, second ed., JO’Reilly Media, Inc, USA, 2017. U. Pacific Science Center’s 14th Annual Foundations of Science Breakfast, https://www.geekwire.com/2017/amazon-web-services-uses-machine-learningmake-capacity-planning-decisions. J. Gao, Machine Learning Applications for Data Center Optimization, Technical Report, Google, 2014. T.R. Couvreur, R.N. Benzel, S.F. Miller, D.N. Zeitler, D.L. Lee, M. Singhal, N.G. Shivaratri, W.Y.P. Wong, An analysis of performance and cost factors in searching large text databases using parallel search systems, Am. Soc. Inf. Sci. Technol. 45 (1994) 443–464. B. Cahoon, K.S. McKinley, Z. Lu, Evaluating the performance of distributed architectures for information retrieval using a variety of workloads, ACM Trans. Inf. Syst. 18 (2000) 1–43. A. Chowdhury, G. Pass, Operational requirements for scalable search systems, CIKM, (2003), pp. 435–442. F. Cacheda, V. Plachouras, I. Ounis, Performance analysis of distributed architectures to index one terabyte of text, ECIR, (2004), pp. 394–408. F. Cacheda, V. Plachouras, I. Ounis, A case study of distributed information retrieval architectures to index one terabyte of text, Inf. Process. Manage. 41 (5) (2005). F. Cacheda, V. Carneiro, V. Plachouras, I. Ounis, Network analysis for distributed information retrieval architectures, European Colloquium on IR Research, (2005), pp. 527–529. F. Cacheda, V. Carneiro, V. Plachouras, I. Ounis, Performance analysis of distributed information retrieval architectures using an improved network simulation model, Inf. Process. Manage. 43 (1) (2007) 204–224. G. Jiang, H. Chen, K. Yoshihira, Profiling services for resource optimization and capacity planning in distributed systems. J. Cluster Comput. (2008) 313–329. B. Lu, L. Ngo, H. Bui, A. Apon, N. Hamm, L. Dowdy, D. Homan, D. Brewer, Capacity planning of a commodity cluster in an academic environment: a case study, LCI International Conference on High-Performance Clustered Computing, (2008), pp. 1–16. W. Lin, Z. Liu, C.H. Xia, L. Zhang, Optimal capacity allocation for web systems with end-to-end delay guarantees, Perform. Eval. (2005) 400–416. C. Zhang, R.N. Chang, C.-s. Perng, E. So, C. Tang, T. Tao, An optimal capacity planning algorithm for provisioning cluster-based failure-resilient composite services, IEEE International Conference on Services Computing, SCC, (2009), pp. 112–119. C.S. Badue, R.A. Baeza-Yates, B.A. Ribeiro-Neto, A. Ziviani, N. Ziviani, Modeling performance-driven workload characterization of web search systems, CIKM, (2006), pp. 842–843. C.S. Badue, J.M. Almeida, V. Almeida, R.A. Baeza-Yates, B.A. Ribeiro-Neto, A. Ziviani, N. Ziviani, Capacity planning for vertical search engines, CoRR (2010), abs/1006.5059. M. Reiser, S.S. Lavenberg, Mean-value analysis of closed multichain queuing networks, J. ACM 27 (2) (1980) 313–322. V. Gil-Costa, J. Lobos, A. Inostrosa-Psijas, M. Marin, Capacity planning for vertical search engines: an approach based on coloured petri nets, ICATPN, (2012), pp. 288–307. M. Marin, V. Gil-Costa, C. Bonacic, A. Inostrosa-Psijas, Simulating search engines, Comput. Sci. Eng. 19 (1) (2017) 62–73. V. Gil-Costa, A. Inostrosa-Psijas, M. Marin, E. Feuerstein, Service deployment algorithms for vertical search engines, PDP, (2013), pp. 140–147. M. Marzolla, Libcppsim: a simula-like, portable process-oriented simulation library in -C++-, ESM, (2004). P.J. Denning, J.P. Buzen, The operational analysis of queueing network models, ACM Comput. Surv. 10 (3) (1978) 225–261. T. Kimura, Diffusion approximations for queues with Markovian bases, Ann. OR 113 (1–4) (2002) 27–40. H.C. Tijms, Stochastic Modelling and Analysis: A Computational Approach, John Wiley & Sons, Inc., 1986. Q. Gan, T. Suel, Improved techniques for result caching in web search engines, WWW, (2009), pp. 431–440. D. Jacobson, D. Yuan, N. Joshi, Scryer: Netflix's predictive auto scaling engine, The Netflix Tech Blog (2013).
163