Accepted Manuscript A performance modeling framework for lambda architecture based applications M. Gribaudo, M. Iacono, M. Kiran
PII: DOI: Reference:
S0167-739X(17)31536-4 http://dx.doi.org/10.1016/j.future.2017.07.033 FUTURE 3563
To appear in:
Future Generation Computer Systems
Received date : 11 January 2017 Revised date : 8 June 2017 Accepted date : 13 July 2017 Please cite this article as: M. Gribaudo, M. Iacono, M. Kiran, A performance modeling framework for lambda architecture based applications, Future Generation Computer Systems (2017), http://dx.doi.org/10.1016/j.future.2017.07.033 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Highlights (for review)
- This paper presents a performance evaluation of Lambda architecture based computing systems - The system is modeled by means of a multisolution approach - Our work aims at providing insights to designers
*Manuscript with source files (Word document) Click here to view linked References
A performance modeling framework for lambda architecture based applications M. Gribaudo Politecnico di Milano, Dip. di Elettronica e Informazione, Milano, Italy
M. Iacono∗ Universit` a della Campania ”Luigi Vanvitelli”, Dip. di Matematica e Fisica, Caserta, Italy
M. Kiran Lawrence Berkeley National Laboratory, Berkeley, California, US
Abstract The lambda architectural pattern allows to overcome some limitations of data processing frameworks. It builds on the methodology of having two different data processing streams on the same system: a real time computing for fast data streams and a batch computing behavior for massive workloads for delayed processing. While these two modes are clearly not new, lambda architectures allow them to coordinate their execution to avoid interference. However resource allocation over cloud infrastructure, has greatly impacted the overall performances (and importantly costs). If performance could be modeled in advance, architects could make better judgments on allocation of their resources to use the systems more efficiently. In this paper, we present a modeling approach, based on multiformalism and multisolution techniques, that provides a fast evaluation tool to support design choices about parameters and eventually lead to better architecture designs. Keywords: modeling languages, lambda architectural pattern, performance evaluation, multiformalism modeling, multisolution methods, cloud, ∗
Contact Author Email addresses:
[email protected] (M. Gribaudo),
[email protected] (M. Iacono),
[email protected] (M. Kiran)
Preprint submitted to Future Generation Computer Systems
June 8, 2017
analytical approach 1. Introduction The coupling of velocity and volume requirements of Big Data applications are important challenges that need to cope with algorithms and require efficient architectural solutions. Data flows, in general, are characterized as high frequency and multiple streams, containing information that needs timely processing to generate value. For example, using sensors in smart cities have increased knowledge about urban areas, like allowing traffic to be managed more efficiently, reduce energy costs and improve citizen experiences. However, with this data explosion, it is a challenge to implement quick processing system to determine meaning from data in a timely manner. Although Big Data systems inherently exploit high parallelism, conventional architectural solutions may not be fast enough to manage and cater to requirements. Specialized solutions are needed that use advanced software and/or hardware to process in real time at minimum costs and implementation. An example is design pattern used in industry is the lambda architecture, defined as ”a data-processing design pattern to handle massive quantities of data and integrate batch and real-time processing within a single framework” [1]. A lambda architecture compliant design, together with a well targeted deployment and a proper configuration of the parameters, is argued to enable an efficient implementation for real-time demands. To support the design choices and adaptation to the dynamics of workload, this paper based on [2], we propose a modeling framework for performance evaluation of systems running applications based on lambda architecture. To provide a user-oriented approach, the framework is based on the SIMTHESys [3] framework to provide a domain-oriented model specification languages. The underlying approach is based on multiformalism modeling and multisolution [4], by means of a domain specific language. By translation, different submodels are solved analytically by an iterative algorithm. The approach is demonstrated against a case study. The paper is organized as follows: Section 2 introduces lambda architectures as implementations of the lambda pattern and Section 3 discusses a brief presentation of related works. The Section 4 presents the modeling approach adopted using the SIMTHESys approach with Section 5 describing the solution technique. Finally the Section 6 summarizes conclusions and presents future extensions of this work. 2
2. Lambda architecture design pattern Cloud computing has introduced multiple opportunities such as infinite resource access and their management for complex tasks. Using cloud-based architectures has lowered the access barriers to large infrastructures by providing virtualization of remote resources at an affordable cost. Being a service-based computing approach (either as a Software as a Service (SaaS), Platform as a Service (PaaS) or Infrastructure as a Service (IaaS), there is a significant effort to design systems to obtain suitable services. Efficient data management, using public cloud computing, have allowed on-demand data and computing in a cost effective manner. Multiple research efforts have focused on data management, cloud computing for data migration, real-time analytics and optimized virtual storage resource management. These are based on user needs and usage predictions. Companies, such as Google and Amazon, are providing cloud platforms and services to interact with sensors and internet of things, collecting and processing data. But at high level there are challenges to process it in timely manner. There are additional performance issues such as cloud QoS, service level agreements (SLA) and coding skills. These are reported worldwide within academia and industry, with little progress in performance and QoS [5]. High availability systems adopt some solutions to ensure 99.9% performance, but modeling their behaviors present an analytical challenge for servers. These include modeling performance metrics, utilization, response time, throughput and much more. Performance modeling allows building trust and confidence in how systems perform. In Clouds, performances can be defined in SLAs, where breaches lead to penalties and business loss to providers. Researchers have modeled performances as a measurement, analytical or simulation methods focusing on QoS parameters over the IaaS layer [6]. Multiple heterogeneous servers are analyzed to determine where incoming requests for VMs are placed. Ismail et al. [7] used OpenStack to benchmark results of MapReduce jobs, in IaaS cloud Hadoop environment. Mei et al [8] focused on optimizing performance impact by analyzing the effects of co-locating applications in a virtualized cloud. The QoS parameters analyzed were throughput and resource sharing effectiveness and number of idle instances on the same host. Their experiments showed a 40% performance gain in the user experience. Data analytics are essential for decision support systems, collecting and processing real data to help find data patterns (in complex data streams). 3
Amazon Web Services (AWS) and Google Cloud provide a collection of resources (cloud hosted databases, map-reduce based engines such as Hadoop, Hive or Spark and virtual machines to host, compute and manage data) as pay-as-you-go services. Yahoo’s Pig project [9], Microsoft’s SCOPE project [10] and Google initiatives [11] integrate declarative query constructs from the database community in map-reduce like software to allow greater data independence, code reusability, and automatic query optimization. Mian et al. [12] highlighted how initial platform configurations can help optimize costs for virtual machine provisioning for executing dynamic data workloads to satisfy SLA constraints. Tailored solutions for online and batch data processing can guarantee that the constraints on non functional attributes, such as cost and network complexity, are satisfied. Solutions such as Spark SQL have aided to obtain faster processing, by compensating the Hadoop processing model weaknesses [13]. Khazai [14] used performance metrics such as task blocking and delays to give a good indication of how well the cloud is performing. Their work presented a need for an efficient admission controller which can examine the time every job will take and perform an intelligent trade-off analysis of which jobs to accept. This work has been extended to use queuing models to improve performance during live migration of virtual machines to reduce task rejection probabilities not compromising on task service time. Optimizing performances of applications can help improve overall system performances and reduce costs of running services. Focusing on the data analytics application of the lambda architecture, it supports real-time and batch processing within a single framework. This pattern, used by Twitter, is suited for applications in which data is collected regularly and computing is performed as soon as the data arrives or delayed by saving data to be processed as large batches, creating two coexisting data processing streams: • real time processing, and • batch processing for large data sources. Please note that Amazon AWS also offers a Lambda service, which allows data processing via a serverless model. This is very different than the Lambda architecture design pattern being discussed in this paper. Here the definition of lambda architecture focuses on building a data processing architecture which allows fast and batch data processing for large datasets.
4
!"#$%&"' !"#$%&"'
()*)+,#-&"''%./+*--01%*
23)4-.+567+8)&1".9
Figure 1: Basic lambda architecture for real time and batch processing.
Figure 1 shows the basic organization of the lambda architecture with batch processing for large data sets and real time computing for data streams. This architecture allows users to optimize data processing costs by understanding which data needs real time or batch processing. Kiran et al. [2] used Amazon AWS services to implement a lambda architecture based solution, showing a cost reduction of 75% in data processing for network router data. Figure 2 shows the overall pattern of how data moves through the system. The real time stream is useful to perform short duration computations, or urgent but approximated evaluations that are feasible in a short time, or to do what needed to store data into buckets for an efficient storage logic to support later batch operations. In this paper we focus on the aspects of lambda architectures that are relevant to predict performances of applications that are characterized by the need for a real time and a batch processing activity, without going into a detailed analysis of the practical aspects of the purposes of suitable applications.
5
8>(C'.#L$%/'( 1.#1"#.$%+*
6&''2
F>(G.$.(.,,%5')('5',D( ='H()'1+*2)(.)(&%*I) !"#$%'( )'*)+,)
3(/%*4( 1.#1"#.$%+*)
0,+1'))()$,'./( .*2(+"$&"$($+( /+,'()$,'./) -%*')%)( )$,'./
J>(K'.,(,'.#L$%/'( 1.#1"#.$%+*M(.)( 2.$.(.,,%5')
?"$&"$( 2.):@+.,2
0#+$( 2',%5.$%5')
9.$1: 7,%$'($+( 68
68
!"#"# ;/"#$%'( /.&&',(<( ,'2"1',)( =+,(/"#$%'( +"$&"$)>
C'.2(.*2(&,+1'))( A+@)(2.%#D(5%.(E#.)$%1( !.&(C'2"1'
68 68
6"@/%$( /.&&',<,'2"1',( A+@)($+(B"'"'
F>(G.$.(1+&%'2( $+(@"1N'$)( ;'4I4(68>
Figure 2: Overall lambda architecture interacting with services.
3. Related work Performance evaluation of large computing facilities, is extremely complex. The scale allows the interrelations and the abstraction layers produce a large number of mutual influences and the number of parameters explodes. In these conditions, simulation approaches need longer times to produce significant and accurate results and the design complexity may become unmanageable. Classical analytical approaches also generally suffer from state space explosion and may not be able to scale up to meaningful subsystems, requiring advanced solutions (e.g. [15]). A survey about analytical and simulation based performance evaluating and monitoring approaches for high scale computing infrastructure is presented in [16]. We also suggest to readers [17] and [18]. In [19] and [15] a modeling approach in presented for Big Data mapreduce based applications, also founded on the design of domain specific languages. This paper proposes a similar approach, but with a different technique. Here, the approach is based on the SIMTHESys framework [3] that allows the design and definition of multiformalism models with custom heterogeneous formalisms with an automatic generation of suitable solver. The general solution approach used in this paper inspired by [3]. 6
Literature offers a few studies that related to lambda architectures. In [2] a lambda architecture is implemented over the Amazon EC2 infrastructure to develop an efficient and cost effective Big Data oriented solution. In [20] a lambda architecture is used as the foundation for a large scale earth observation information mining application. In [21] Apache Storm is used to implement real time social network data stream processing. 4. Modeling approach The performances of a system complying with the lambda architecture are influenced by several factors. The number of data streams that need to be processed and frequency at which updates arrive determines the workload of the speed component of the architecture. The workload of the batch component is determined by the number of batch processes that are executed and by the frequency at which they occur. The workload of the serving component is defined by the number of queries to be executed and the frequency at which they are requested to be executed. The performance of each component is closely related to the complexity of the application they execute: in particular, if they are based on frameworks such as Apache Spark or Storm, they are greatly influenced by the number of stages they are composed of, and by the relations among them that can cause blocking or waiting for other components to finish their tasks before continuing with the execution of the job. Moreover, all the frameworks proposed to implement the applications work in parallel environments, in which the total processing speed can be proportional to the number of compute nodes provisioned in the system. Also the storage plays an important role in determining the performances of the system since it is used as a connection link between several components of the lambda architecture: the stream and the batch processing, and both of them to the serving component. As for computation, the storage technologies proposed for the lambda architecture rely on both replication and distribution of the data on several nodes to increase performance, reliability, availability and capacity. The performances of the storage sub-system is then determined by the number of available nodes and by the replication parameters used to configure the adopted storage technologies.
7
Data Stream
Storage Cluster
Trigger
Batch Process
Compute Cluster
Read/Write
Stage
Runs on
User
Figure 3: Modeling primitives.
4.1. Modeling primitives In order to model the performance of lambda architecture systems, the user must be able to characterize them in terms of all influencing factors. Although this can be done using standard performance evaluation formalisms for which specific tools exists (for example Queuing networks and JMT [22], Generalized Stochastic Petri Nets and GreatSPN [23], Performance Evaluation Process Algebra and PEPA Workbench [24]), this would require very complex models, only manageable by experts in the field. To bridge this gap, and allow lambda architecture experts to quickly assess the performances of their projects, we propose a domain specific language tailored for the description of the considered type of systems. In particular, we propose a graph-based language whose nodes and arcs are shown in Figure 3. Workloads for the speed, batch and serving components are defined respectively by the Data Stream, Batch Process and User nodes. A node in the model of the first type characterizes one of the streams the system has to process: if the application has to combine data coming from several different sources, the model will have one Data Stream element for each. The element is characterized by property λ, that defines the interarrival time distributions between samples in the stream. The Batch Process node is characterized by the number of concurrent batch jobs N, and their think time distribution Z. This reflects one of the common ways in which batch elaboration is usually implemented: there are N processes that run continuously. Whenever a process completes its tasks, it waits for a given amount of time Z before 8
restarting, to reduce the load of the system. The User nodes define different types of queries that might be submitted to the serving component of the system: each one is characterized by its own interarrival time distribution λ that specifies the interval between two successive requests. The hardware specification of the architecture is given with the Storage cluster and Compute cluster nodes. The former are characterized by the number of nodes in the cluster N, the replication factor r, the consistency quorum a and the average service time distribution S. Parameters N, r and a are used to model the main features that influence the parameters of the type of distributed storage used in the lambda architecture. In particular, we consider that data is equally shared among N storage nodes, and that each request is mirrored r times. During a read operation, however, only a out of r copies are required to be read to retrieve data, leaving the other r − a as redundancy to increase the availability of the system. Distribution S defines the average service time of each disk access on a single node (either read or write). Compute cluster primitives are characterized by the number of nodes in the cluster N, and by a speedup factor α. This allows to model clusters with hardware belonging to different generations and working at different speeds. A job taking D1 time units on a node with speedup α1 will run in D2 = Dα1 2α1 time units on a node with speedup α2 . The software architecture is modeled by nodes of type Stage, characterized by parameters N and S. In particular, each node represents a computation stage that can be executed with N parallel tasks, each one with duration S (when executed on a compute node with speedup α = 1). Table 1 summarizes the nodes available in the formalism to model various lambda architecture scenarios, together with their corresponding parameters. The relation between the node elements of models are defined using three possible arcs. Trigger arcs, that are represented by solid arrows, define relations that correspond to the triggering of tasks running on compute nodes. A new execution of a job stage can be triggered by the arrival of new data (Trigger arcs connecting Data Stream to Stage nodes), by the start of a new batch process (connections from Batch Process to Stage nodes), by a query from an external user (connections from User to Stage nodes), or by the end of a previous stage (connections from Stage to Stage nodes). Data Stream and Stage nodes can use storage devices to read/write their data on. This is modeled with Read/Write arcs, that are represented with a dashed arrow. If the arc is directed toward a Storage Cluster node, it represents a write operation; if the arc originates from the storage, it corresponds to 9
Element Data Stream Batch Process User Storage cluster
Compute cluster Stage
Param. λ N Z λ N r a S N α N S
Description Inter-arrival time distribution of the data Number of concurrent batch jobs “Think time” distribution of each job Inter-arrival time distribution of the queries Number of disks in the storage cluster Replication factor of the data Consistency quorum Disks service time distribution Number of compute nodes in the cluster Nodes speedup factor Number of tasks required by the stage Task service time distribution
Table 1: Graph nodes modeling primitives and their parameters.
a read. Application stages are executed on compute nodes: the association between the software and the hardware (which could also be virtualized) is modeled by arcs of type Runs on, represented by dotted arrows, that connect Stage nodes to Compute Cluster nodes. 4.2. Assumptions To produce a model that can be managed to assess performance indices, a set of simplifying assumptions is required. In this paragraph we will summarize such assumptions, discuss their impact and present, when possible, workarounds that the modeling language allows to circumvent their limitation. In particular, we assume node homogeneity in a compute cluster: all the machines are identical and can access the storage resources in the same way. Only the speed-up factor is allowed to scale the execution time on different clusters, but all the machines inside a cluster must be characterized by the same performance properties. This assumption is not limiting, since in most of the cases resources will be acquired from a datacenter where the goal of the provider is to keep them as uniform as possible, to increase maintainability and manageability. We suppose in each stream data homogeneity between samples: they all represent data of the same type and of the same complexity (i.e. we avoid having in the same stream, for example, both movies and static images as possible input). This is however not a real limitation, since several different streams 10
can be introduced, each one representing data of different complexity (i.e. we could have one static images stream, and one movie stream). We also suppose homogenous query complexity: all queries from a user must require a similar effort by the system to be answered. Again this is not a limiting factor since we can include several different User primitives in a model. We suppose storage access homogeneity: the time required to perform a storage operation does not depend on the type of the data, but just on the speed of the disks, on their number, on the replication and on the consistency quorum. This assumption is in-line with the type of storage systems used, where data is divided in big chunks (i.e. 128MB) that are equally spread among available disks. We assume that the application complexity is entirely defined by the stages that compose it, and that the relations among the stages can be defined by a DAG (Directed Acyclic Graph). We suppose that there is a scheduler that assigns each of the tasks of a stage in parallel to the machines that are currently available in the cluster in a round-robin fashion. In particular, in each stage, all tasks are considered to be identical. As soon as all the tasks of one stage are finished, they can start the one connected to them by the output arcs. However, one stage can start only when all the stages connected to it by input arcs are finished. Batch jobs are considered to be concluded when there is a Trigger arc that returns the execution to the Batch Process node that originated them. These assumptions are in line with the way in which engines like Apache Spark or Storm work. Finally, we consider the system working in its optimal conditions: all the nodes are up and running, and no fault-tolerant technique that can be used to improve the reliability of the architecture, but that can reduce the performance, is applied. For workflows characterized by a limited number of nodes, this will be the case. When the number of nodes in the cluster grows to a large number (i.e. thousands of nodes), such assumption might be no longer valid: in this case a suitable workaround is to include in the service time distribution some tail behavior to consider the degraded performance in case of occurrence of recovery procedures. 4.3. Examples To better show the primitives of the proposed modeling language, we focus on a simple example of application conforming to the lambda architecture. In particular, we focus on data coming from a single stream: the speed computation is performed by a simple process executed on each new reading. 11
Batch Process
Map
Reduce
Compute Cluster
Speed
Serving
Query User
Data Stream
Storage Cluster
Figure 4: A single stream processed by nodes in a single cluster.
The batch process executes a map-reduce job to analyze the data, and user queries can be executed with a single job. Figure 4 shows a simple deployment of such an application, in which all computing tasks runs on the same cluster (node Compute Cluster), and a single storage pool is used (node Storage Cluster). The single stream of data is represented by node Data Stream: the triggering of the speed part is modeled by the Trigger arc that connects it to the Speed stage, and the recording on the log to be processed in batch mode is represented by the Read/Write arc that connects it to the Storage Cluster node. The batch map-reduce process is modeled by the loop that connects with Trigger arcs nodes Batch Process, Map and Reduce. The access of the batch process to the recordings of the stream is represented by the Read/Write arc that connects Storage Cluster to Map. The use of the storage cluster to hold both the real time data and results of the batch process is modeled by the Read/Write arcs that connect nodes Reduce and Speed as input, and Serving as output to Storage Cluster. The serving process is modeled by the Query User node, connected to the Serving stage. The deployment of the four stages on a single cluster is modeled by the four Runs on arcs that connect the four stages, Map, Reduce, Speed and Serving, to the node Compute Cluster. In Figure 4 the same application scenario is modeled in a deployment in which three compute clusters, represented by nodes Speed Cluster, Batch 12
Batch Cluster
Batch Process
Map
Reduce
Speed Cluster
Serving Cluster
Batch Storage
Data Stream
Speed
Serving Storage
Serving
Query User
Figure 5: A single stream processed by nodes in multiple clusters per layer.
Cluster and Serving Cluster, are used to put each layer in a separate environment. In this case the four Runs on arcs connect respectively Map and Reduce to Batch Cluster, Speed to Speed Cluster and Serving to Serving Cluster. This deployment also uses two storage pools, one for holding the logged measures, Batch Storage, and one to hold the results of both the batch and the speed layer for the serving component, Serving Storage. To account for this, in the model Read/Write arcs go from Data Stream to Batch Storage, from Batch Storage to Map, from Reduce and from Speed to Serving Storage, and from Serving Storage to Serving. 4.4. Performance indices We are interested in computing several performance indices on the proposed model. For what concerns the speed component, we are mainly interested in determining whether the system will be able to serve incoming streams in the current configuration. In case of stability, we could determine the end-to-end delay of one sample: the waiting time required before data is processed by the speed component could be queried in the serving part of the system. For the batch component, we are mainly interested in the batch throughput: the number of times the considered batch computation will be repeated in a given time frame. Finally, for the serving layer we are interested in assessing its stability (if it is sufficiently performant to serve all the incoming request without saturation), and the expected response time 13
(the time required for each query to return its answer to the user). For the hardware components, that are Storage Clusters and Compute cluster, we are interested in computing the average utilizations, the average number of tasks currently in their queue, the throughput and the average response times. 5. Results Models created according to the formalism presented in Section 4 are solved using a multisolution approach1 as a set of interdependent queuing networks. In particular, each subsystem identified by specific primitives is converted into either a queuing or deterministic timing model for which a quick analytical solution exists. The whole model is thus converted into a set of submodels, each one characterized by three sets of properties: the input parameters, the interaction parameters and the output indices. Input parameters correspond to the properties assigned in the model to the elements of the formalisms: they define the timing characteristics of the system under study (i.e. the average service times, the arrival rates of new requests and so on). Interaction parameters are instead measures that characterize the interactions with the other components of the models: for example it could be the expected read/write delay of a storage component, used to define the total service time of a request in a computation node. Output indices are instead the performance indices associated to the system model primitives (i.e. the one defined in Section 4.4), or additional measures required by the multisolution process. At the beginning of the solution, interaction parameters are set to initial guesses. Each submodel is solved independently from the others, and then their output indices are combined to define a new set of interaction parameters for all the subsystems. Models for the storage, speed and serving component might become unstable: if this occurs, an exception is raised and the iteration process stops assessing the system as unstable. Otherwise, the process is repeated in a fix-point iteration until either all the interaction parameters had a relative change that is smaller than a threshold, or when a 1
Multisolution is ”the possibility of applying, selectively or in parallel, different solving engines to the same model, e.g. to optimize a solution process accordingly to some characteristics of the specific model” [4].
14
maximum number of iterations is reached. The process is depicted in the following algorithm: 1: initialise 2: repeat 3: solve storage model 4: if model unstable then 5: exit system unstable 6: end if 7: solve speed and serving models 8: if models unstable then 9: exit system unstable 10: end if 11: solve batch model 12: if Convergence reached then 13: exit solution found 14: end if 15: update interaction parameters 16: until maximum iterations reached 17: exit solution failed 5.1. Submodels for the Storage Cluster components Storage Clusters are modeled with M/M/1/FIFO queues, in which the 1 average service rate µ = E[S] is derived from parameter S of the corresponding node, and the total arrival rate λ is computed from the rates at which the disk is accessed and its other parameters N, r and c. In particular, let us call λR and λW the arrival rates for reads and writes to the storage. In our solution technique λR and λW are two interaction parameters of the sub-models representing the storage subsystems. We assume that requests are equally shared among the N disks in the cluster. Read requests need to access only c disks, while write access requests need to access r disks to account for both the replication and the consistency check factors. For this reason, the arrival rate to the disks is determined as λ = rλWN+cλR . The FIFO assumption derives from the fact that requests from the storage are usually run in sequence by the controller. The exponential assumptions for both the arrival and service times are used to reflect the variability in the input stream and in the media access process. The average response time is then using
15
the conventional M/M/1 formulas: RStorage =
1 µ−λ
(1)
5.2. Submodels for the User and Data Stream components User and Data Stream components are modeled as M/G/c/PS queues. The arrival rate λ corresponds either to the number of query executed per second by the users (User nodes), or to the number of sample reads per second (Data Stream nodes). The service time corresponds to µ1 = TStages , the time required to execute the connected stages. Both TStages and the assigned number of cores c are interaction parameters of the submodels. From these parameters, the average number of used nodes U and the average response times R can be computed using standard queuing networks formulas since M/G/c/PS queues have the same performance indices as M/M/c queues: UUser/Stream = U =
RUser/Stream =
c−1 X n=1
λ µ
U c+1 (c − 1)! (c − U)2
Uc U + n! (c − 1)!(c − U)
(2)
+
1 µ
(3)
5.3. Submodels for the Batch Process components Batch process components are studied by means of an embedded Markov chain whose states account for the number of jobs in service in the computation infrastructure when a new job starts its execution. This model corresponds to an M/D/1/K/K system, where the total population K = N also corresponds to the total capacity of the queue, the deterministic service time D corresponds to the time required to execute the stages, and the state-dependent Poisson arrivals parameter λ = Z1 accounts for batch processes that end their “think time” phase to start being elaborated by the system. The deterministic assumption can be reasonable for the batch component since one of the goals of map-reduce jobs is trying to achieve a constant execution time, reducing the variance of the response times. The embedded Markov chain is characterized by a N × N transition matrix
16
C defined as follows: 0 1 0 · · · 0 P (0, N − 1) P (1, N − 1) P (2, N − 1) · · · P (N − 1, N − 1) 0 P (0, N − 2) P (1, N − 2) ··· P (N − 2, N − 2) C= .. .. .. . . . 0 0 0 0 ··· P (0, 1) P (1, 1) (4) in which P (j, n) represents the probability of having j arrivals out of n possible jobs with think time Z = λ1 , and deterministic duration D, and it is defined, thanks to the Poisson assumption, as: n−j j n e−λD 1 − e−λD (5) P (j, n) = j
The steady state solution vector π = |π1 , . . . , πN | of the embedded Markov chain is computed in the usual way: n πC = π X (6) πi = 1 i=1
and is then normalized to account for the time spent in the embedded states: π1 Z i=1 π 1 Z + (1 − π1 )D pi = (7) πi D i>1 π1 Z + (1 − π1 )D
Utilization of the submodel is then computed as UBatch = 1 − p1 , from which the throughput is computed using the Utilization Law XBatch = UBatch . The D N − Z is computed using the Response average response time RBatch = XBatch Time Law [25].
5.4. Submodels for the Stage components The time required to compute a workflow specified by an interconnected set of Stage nodes is approximated as the sum of the completion times of the involved stages. The time required by a stage TStage is approximated again with a deterministic assumption. Let us assume that a stage with N tasks, 17
each one of duration S, is run on a compute cluster with speed α, running on c of its cores. The number of cores c assigned to a specific stage is an interaction parameter for the submodel characterizing the stage. Let us also assume that if a stage is accessing data on a storage cluster, it will require a time specified by an interaction parameter RStorage . If a stage does not access any storage cluster, then we suppose RStorage = 0. The execution time is computed as: E[S] N TStage = (8) + RStorage α c where ⌈x⌉ is the smallest integer greater than x. Since we consider the system to be stable, the throughput of each stage XStage is equal to the arrival rate of the input process λ if the stage is used by a User or Data Stream component, or by the throughput of the batch process XBatch if it is used in a batch: λ for User or DataStream XStage = (9) XBatch for Batch
5.5. Interaction parameters setup Initially, for all the compute node shared by several services (i.e batch, speed and serving applications), the available nodes are equally subdivided among them: this determines the interaction parameters c for the submodels corresponding to User, Data Stream and Stage nodes. Initially also, the throughput of all batch submodels XBatch = 0 is set to zero. Then, at each iteration, storage nodes are considered first, by computing their read and write loads λR and λW . In particular we have: X X λR = λi + Xi Ni (10) stages i reading
user/stream i reading
λW =
X
λi +
user/stream i writing
X
Xi Ni
(11)
stages i writing
(12)
Using the performances of the disks, Stage nodes are evaluated. Then User, Data Stream and Stage submodels are analyzed. In particular, the throughputs XBatch will be used in the next iteration to have a better estimation of the storage workloads. At the end of each iteration, for all compute nodes shared by more services, a new set of sharing assignments, proportional to the actual utilization obtained in the previous iteration, is computed. 18
(!"
#!!" +!"
'!"
*!"
!(),-+(
(!" %!"
'!" &!"
$!"
%!"
!"#$%&'()'*+(
)!"
&!"
,-./012" ,3224" ,2/5671" 80-9:"
$!"
#!"
#!" !"
!" !"
$"
&"
("
*"
#!"
#$"
.%/0$1(/$%0(),$123,-+(
Figure 6: Varying the arrival rate of the stream, with a single compute cluster.
5.6. Notes about the selected queuing models and the corresponding service time distributions The queuing models used for the various system components might seem arbitrary, and the fact of not being able to use general distributions in all cases might seems to be a limitation. However the proposed sub-models are general enough to include several realistic applications. M/M/1 queues have been extensively used in studying raid disks, and the FIFO queue assumption is in line with the way in which distributed storage systems serves concurrent requests. Processor sharing is a reasonable approximation for scheduling policies used by most of the current O.S. The assumptions on the deterministic lengths assumed for the batch elaboration and the stages in which such process is divided, have already been discussed in the relevant sections. 5.7. Numerical results We have analyzed the models given in Figures 4 and 5 with the parameters summarized in Table 2. Values have been chosen not to match a real application, but to emphasize the results that can be obtained exploiting the proposed modeling technique. Figure 6 shows the average response times for the storage, the speed, the batch and the serving components of the single cluster model shown in Figure 4 for different arrival rates λ of the stream. With the given configuration, the system is able to handle about to λ = 12.5 samples per minute: for this reason we have varied λ in the range [0.5, 12]. Note that two different scales 19
Common
Element Data Stream Batch Process Query User Map Reduce Speed Serving
Par. λ N Z λ N S N S N S N S
Val. 1 samp/m. 10 120 m. 0.2 req/m. 144 1 m. 192 1.5 m. 1 30 s. 1 6 s.
Model specific Element Par. Storage Cluster N r a S Compute Cluster N α Batch Storage N r a S Serving Storage N r a S Batch Cluster N α Speed Cluster N α Serving Cluster N α
Table 2: Model parameters.
20
Val. 10 3 2 6 s. 20 1 4 3 2 6 s. 6 3 2 4.8 s. 6 1 10 1.2 4 1
)!"
*!"
(!"
)!" (!"
!(),-+(
'!" &!" &!" %!" %!"
!"#$%&'()'*+(
'!"
,-./012" ,3224" ,2/5671" 80-9:"
$!" $!" #!"
#!"
!" $!"
%!"
&!"
'!"
(!"
)!"
*!"
+!"
#!!"
!" ##!"
./0123(45(&406/%2(7482,(
Figure 7: Varying the number of nodes of a single compute cluster.
for the y-axis are used, since requests for the storage, speed and serving submodels have response times in the range of seconds, while batch jobs take hours to be completed. As expected, all performance indices tend to worsen as the load increases: the batch component however presents a stepped behavior, while the other components have a more continuous evolution. This is caused by the way in which nodes are assigned to the applications: the speed and serving components are served first, and the batch processes run on the remaining nodes. Whenever the load of the speed and serving components become so large to require an extra node to handle them, there is a jump in the response time of the batch component, since now they have to run on a more limited set of resources. The same scenario, with a fixed input stream arrival rate of λ = 12 samples per minute and a variable number of nodes N in the range 20 − 110 is shown in Figure 7. As it can be seen, we have an expected reduction in the time required to execute batch jobs as the number of available nodes decreases. Again, the performances of the batch component present some stepped behavior for the way in which nodes are shared among the services. However, we see an unexpected decrease in performance for the storage, the speed and the serving components. This occurs because a reduction of the execution time for the batch component implies an increase of the corresponding throughput XBatch , that in turns causes an increase of the workload of the storage component. As a chain effect, the increase of the storage response time RStorage causes a decrease of the performances of the speed and the 21
%#"
$&!"
$%!" %!"
$#"
(!"
'!"
$!"
!"#$%&'()'*+(
!(),*+(
$!!" )*+,-."/+0." /123."/+0." /4115" /123670" &!"
)*+,-" #"
%!"
!"
!" !"
%"
&"
'"
("
$!"
$%"
-%./$0(.$%/(),$012,*+(
Figure 8: Varying the arrival rate of the stream, with multiple compute clusters.
serving components. To avoid this negative influence, the system administrator should increase the think time Z for batch processes to reduce their throughput and to avoid penalizing the other layers. Figures 8 and 9 show the same type of analysis for the multiple cluster configuration shown in Figure 5. In order to make a fair comparison, resources have been divided to match the same number of the single cluster configuration. In particular, the N = 10 disks of the single storage have now been partitioned assigning NBatch = 4 storage nodes to the batch component, and NServing = 6 to the serving component. In the same way, the N = 20 compute nodes have been divided into NBatch = 6, NSpeed = 10 and NServing = 4 nodes. In the study that varies the total number of nodes, new resources have been assigned to each cluster in order to maintain the same proportion as the initial configuration. However, to test the possibility of considering heterogeneous setups, the speedup for the Speed Cluster has been increased to α = 1.2, and the service time for the Serving Storage has been reduced to E[S] = 4.8 s. As we can see, the system presents more or less the same evolution, even if in these conditions there are no jumps in the performances of the batch component. This occurs because the static assignment of compute nodes does not cause sudden jumps in the resources assigned to the batch layer, allowing it to follow a more smooth evolution. Comparing instead the results for the two type of systems, we can see that the fixed allocation case privileges the batch processing tasks, while the single cluster case emphasizes the performances 22
%(!"
("
%&!"
'#$" '"
!(),*+(
$" *!" &" )!" %#$" (!" %" &!"
!"#$%&'()'*+(
%!!" -./01#"2/3#" 2456#"2/3#" 27448" 24569:3" -./01"
!#$"
!" &!"
'!"
(!"
$!"
)!"
+!"
*!"
,!"
%!!"
!" %%!"
-./012(34(&3/5.%1(6371,(
Figure 9: Varying the number of nodes of a multiple compute clusters.
of both the speed and serving layers. Again this is caused by the larger load on the storage that an improved batch service causes to the system. 5.8. Convergence and stability If the system is unstable (i.e. it is not suitable to handle the speed at which either sensor reads or user queries arrive), the fix-point iteration will not converge and the iteration will stop as soon as one of the subsystems modeled either as an M/M/1 or M/M/c queue will saturate (i.e., utilization U ≤ 1 for the M/M/1 or U ≤ c for the M/M/c). Unfortunately, we cannot prove the convergence for the cases in which the system is stable due to the complexity of the analytical solution for the M/D/1/K/K queues used to model the batch component of the system. However, we know from the operational analysis laws that the throughput of all batch processes is limited by XBatch ≤ D1 . This asymptote bounds the arrival rates for λR and λW of storage clusters, which in turn bound their average response time RStorage . If the system remains stable with the maximum average response time for the storage, then the algorithm cannot diverge. After the first iteration, response times will increase at each iteration: this should ensure convergence, at least when no compute cluster is shared among different processes (i.e. models such as the one shown in Figure 5). When there are shared clusters, we cannot prove that the fixed point algorithm will not end up in a loop of configurations where the available nodes are shared in different ways among the processes using the same compute cluster. 23
Nevertheless, we have extensively tested the approach and obtained (when the system is stable) convergence in a few (around 20) iterations. If the maximum number of iterations is reached, then the proposed multisolution technique cannot be applied, and user has to exploit more precise (but more time consuming) solutions, such as discrete event simulation of the model. We are currently developing a simulator of the models described by the language presented in Section 4 to address such cases (even if less efficiently). 6. Conclusions In this paper we presented a modeling approach that is suitable for performance evaluation of lambda architectures to support the design and assessment decision process. The proposed solution is able to provide a fast tool that works with synthetic characterizations, or hypotheses, about the workload and converges in a wide set of the parameter space. To let the approach be usable by domain experts that are not familiar with analytical modeling, we also provided a domain specific language that hides the complexity of the evaluated model behind a more abstract and friendly model. Future works include the development of a different solver, still under the umbrella of the SIMTHESys framework, to face the cases in which the iterative analytic solver does not converge by means of simulation. 7. Acknowledgement This article is based upon work from COST Action IC1406 High-Performance Modelling and Simulation for Big Data Applications (cHiPSet), supported by COST (European Cooperation in Science and Technology). References [1] Amazon, Lambda architecture for batch and real-time processing on AWS withff Spark Streaming and Spark SQL. URL http://img.bss.csdn.net/201508121544255257.pdf [2] M. Kiran, P. Murphy, I. Monga, J. Dugan, S. S. Baveja, Lambda architecture for cost-effective batch and speed Big Data processing, in: Proceedings of the 2015 IEEE International Conference on Big Data (Big Data), BIG DATA ’15, IEEE Computer Society, Washington, DC, USA, 2015, pp. 2785–2792. 24
[3] E. Barbierato, G.-L. D. Rossi, M. Gribaudo, M. Iacono, A. Marin, Exploiting product forms solution techniques in multiformalism modeling, Electronic Notes in Theoretical Computer Science 296 (0) (2013) 61 – 77. doi:http://dx.doi.org/10.1016/j.entcs.2013.07.005. [4] M. Gribaudo, M. Iacono, An introduction to multiformalism modeling, in: M. Gribaudo, M. Iacono (Eds.), Theory and Application of MultiFormalism Modeling, IGI Global, Hershey, 2014, pp. 1–16. [5] H. Khazaei, J. Misic, V. B. Misic, Performance analysis of cloud computing centers using m/g/m/m+r queuing systems, IEEE Transactions on Parallel and Distributed Systems 23 (5) (2012) 936–943. doi:10.1109/TPDS.2011.199. [6] D. Gill, H. M. Pandey, Approaches for software performance modelling, cloud computing and Openstack, International Journal of Computer Applications 119 (22) (2015) 31–35approaches for software performance modellingg. [7] M. A. Ismail, M. F. Ismail, H. Ahmed, Openstack cloud performance optimization using Linux services, in: 2015 International Conference on Cloud Computing (ICCC), 2015, pp. 1–4. doi:10.1109/CLOUDCOMP.2015.7149648. [8] Y. Mei, L. Liu, X. Pu, S. Sivathanu, Performance measurements and analysis of network I/O applications in virtualized cloud, in: 2010 IEEE 3rd International Conference on Cloud Computing, 2010, pp. 59–66. doi:10.1109/CLOUD.2010.74. [9] C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins, Pig Latin: A not-so-foreign language for data processing, in: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, ACM, New York, NY, USA, 2008, pp. 1099–1110. doi:10.1145/1376616.1376726. [10] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, J. Zhou, Scope: Easy and efficient parallel processing of massive data sets, Proc. VLDB Endow. 1 (2) (2008) 1265–1276. doi:10.14778/1454159.1454166.
25
[11] J. Hoover, Start-ups bring Google’s parallel processing to Data Warehousing, Information Week. URL http://www.informationweek.com/software/information-management/start-up [12] R. Mian, P. Martin, J. L. Vazquez-Poletti, Provisioning data analytic workloads in a cloud, Future Generation Computer Systems 29 (6) (2013) 1452 – 1458, including Special sections: High Performance Computing in the Cloud and Resource Discovery Mechanisms for {P2P} Systems. doi:https://doi.org/10.1016/j.future.2012.01.008. [13] R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, I. Stoica, Shark: Sql and rich analytics at scale, in: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, ACM, New York, NY, USA, 2013, pp. 13–24. doi:10.1145/2463676.2465288. [14] Y. Mei, L. Liu, X. Pu, S. Sivathanu, Performance measurements and analysis of network i/o applications in virtualized cloud, in: 2010 IEEE 3rd International Conference on Cloud Computing, 2010, pp. 59–66. doi:10.1109/CLOUD.2010.74. [15] A. Castiglione, M. Gribaudo, M. Iacono, F. Palmieri, Exploiting mean field analysis to model performances of Big Data architectures, Future Generation Computer Systems 37 (0) (2014) 203–211. doi:http://dx.doi.org/10.1016/j.future.2013.07.016. [16] M. Gribaudo, M. Iacono, F. Palmieri, Performance modeling of Big Data oriented architectures, in: J. Kolodziej, F. Pop, B. Di Martino (Eds.), Resource Management for Big Data Platforms and Applications, Computer Communications and Networks, Springer International Publishing, 2016, pp. 3–34. doi:10.1007/978-3-319-44881-7. [17] L. Xu, J. Cipar, E. Krevat, A. Tumanov, N. Gupta, M. A. Kozuch, G. R. Ganger, Agility and performance in elastic distributed storage, Trans. Storage 10 (4) (2014) 16:1–16:27. doi:10.1145/2668129. [18] F. Yan, A. Riska, E. Smirni, Fast eventual consistency with performance guarantees for distributed storage, in: Distributed Computing Systems Workshops (ICDCSW), 2012 32nd International Conference on, 2012, pp. 23–28. doi:10.1109/ICDCSW.2012.21. 26
[19] E. Barbierato, M. Gribaudo, M. Iacono, Modeling Apache Hive based applications in Big Data architectures, in: Proceedings of the 7th International Conference on Performance Evaluation Methodologies and Tools, ValueTools ’13, ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), ICST, Brussels, Belgium, Belgium, 2013, pp. 30–38. doi:10.4108/icst.valuetools.2013.254398. [20] M. Quartulli, J. Lozano, I. G. Olaizola, Beyond the lambda architecture: Effective scheduling for large scale eo information mining and interactive thematic mapping, in: 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2015, pp. 1492– 1495. doi:10.1109/IGARSS.2015.7326062. [21] A. Batyuk, V. Voityshyn, Apache Storm based on topology for real-time processing of streaming data from social networks, in: 2016 IEEE First International Conference on Data Stream Mining Processing (DSMP), 2016, pp. 345–349. doi:10.1109/DSMP.2016.7583573. [22] M. Bertoli, G. Casale, G. Serazzi, JMT: performance engineering tools for system modeling, SIGMETRICS Performance Evaluation Review 36 (4) (2009) 10–15. doi:10.1145/1530873.1530877. URL http://doi.acm.org/10.1145/1530873.1530877 [23] S. Baarir, M. Beccuti, D. Cerotti, M. D. Pierro, S. Donatelli, G. Franceschinis, The GreatSPN tool: recent enhancements, SIGMETRICS Performance Evaluation Review 36 (4) (2009) 4–9. doi:10.1145/1530873.1530876. URL http://doi.acm.org/10.1145/1530873.1530876 [24] S. Gilmore, J. Hillston, The PEPA workbench: A tool to support a process algebra-based approach to performance modelling, in: Computer Performance Evaluation, Modeling Techniques and Tools, 7th International Conference, Vienna, Austria, May 3-6, 1994, Proceedings, 1994, pp. 353–368. [25] E. D. Lazowska, J. Zahorjan, G. S. Graham, K. C. Sevcik, Quantitative System Performance: Computer System Analysis Using Queueing Network Models, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1984.
27
*Biographies (Text)
Marco Gribaudo is an Associate Professor at the Politecnico di Milano, Italy. He works in the performance evaluation group. His current research interests are multi-formalism modeling, queueing networks fluid models, mean field analysis and spatial models. The main applications to which the previous methodologies are applied comes from Big Data applications, Cloud Computing, Multi-Core Architectures and Wireless Sensor Networks.
*Biographies (Text)
Mauro Iacono is a tenured Assistant Professor and Senior Researcher (holding a qualification as Associate Professor) in Information Processing Systems at Dipartimento di Matematica e Fisica, Università degli Studi della Campania "Luigi Vanvitelli" (prevously known as Seconda Università degli Studi di Napoli). He holds a PhD. degree in Electrical Engineering and a MSc. degree in Computer Engineering. He published more than 70 peer reviewed scientific papers on journals, books and conference proceedings and has served as editor, chairman, committee member and reviewer for around 25 journals and more than 100 conferences. He is a member of IEEE and other scientific societies. His research activity is mainly centered on the field of performance modeling of complex computerbased systems, with a special attention for multiformalism modeling techniques, critical systems, Cloud and Big Data systems, cyberphysical systems. More information is available at http://www.mauroiacono.com.
*Biographies (Text)
Dr. Kiran's works as a Research Scientist at LBNL, working on intentbased networking and engineering intelligent networks for optimizing performance and user experience. Her work focuses on learning and decentralized optimization of system architectures and algorithms for high performance computing, agent-based simulations, underlying networks and Cloud infrastructures. She has been exploring various platforms such as HPC grids, GPUs, Cloud and SDN-related technologies. She uses optimization of QoS, performance using parallelization algorithms and software engineering principles to solve complex data intensive problems such as large-scale complex simulations. Over the years, she has been working with biologists, economists, social scientists, building tools and performing optimization of architectures for multiple problems in their domain.
*Biographies (Photograph) Click here to download high resolution image
*Biographies (Photograph) Click here to download high resolution image
*Biographies (Photograph) Click here to download high resolution image