Future Generation Computer Systems (
)
–
Contents lists available at ScienceDirect
Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs
Fault-tolerant Service Level Agreement lifecycle management in clouds using actor system Kuan Lu a,b,∗ , Ramin Yahyapour a,b , Philipp Wieder a , Edwin Yaqub a,b , Monir Abdullah a,d , Bernd Schloer a , Constantinos Kotsokalis c a
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen, Germany
b
The University of Göttingen, Germany
c
AfterSearch P.C., Greece
d
Thamar University, Yemen
highlights • • • • •
We elaborate a utility architecture that optimizes resource deployment. We provide a mechanism for optimized SLA negotiation. Using Actor system as basis, the entire SLA management can be efficiently parallelized. We separate the agreement’s fault-tolerance strategies into multiple autonomous layers. A realistic approach for the automated management of the complete SLA lifecycle.
article
info
Article history: Received 30 September 2014 Received in revised form 6 March 2015 Accepted 19 March 2015 Available online xxxx Keywords: Service Level Agreements lifecycle Cloud Actor system
abstract Automated Service Level Agreements (SLAs) have been proposed for cloud services as contracts used to record the rights and obligations of service providers and their customers. Automation refers to the electronic formalized representation of SLAs and the management of their lifecycle by autonomous agents. Most recently, SLA automated management is becoming increasingly of importance. In previous work, we have elaborated a utility architecture that optimizes resource deployment according to business policies, as well as a mechanism for optimization in SLA negotiation. We take all that a step further with the application of actor systems as an appropriate theoretical model for fine-grained, yet simplified and practical, monitoring of massive sets of SLAs. We show that this is a realistic approach for the automated management of the complete SLA lifecycle, including negotiation and provisioning, but focus on monitoring as the driver of contemporary scalability requirements. Our proposed work separates the agreement’s fault-tolerance concerns and strategies into multiple autonomous layers that can be hierarchically combined into an intuitive, parallelized, effective and efficient management structure. © 2015 Elsevier B.V. All rights reserved.
1. Introduction Contemporary IT service management (ITSM) expects service level to be managed alongside functional service properties. This requires that arrangements are in place with internal IT support
∗ Corresponding author at: Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen, Germany. Tel.: +49 551 3920427; fax: +49 551 2012150. E-mail addresses:
[email protected] (K. Lu),
[email protected] (R. Yahyapour),
[email protected] (P. Wieder),
[email protected] (E. Yaqub),
[email protected] (M. Abdullah),
[email protected] (B. Schloer),
[email protected] (C. Kotsokalis). http://dx.doi.org/10.1016/j.future.2015.03.016 0167-739X/© 2015 Elsevier B.V. All rights reserved.
providers and external suppliers in the form of Operational Level Agreements (OLAs) and Underpinning Contracts (UCs), respectively [1]. More recently, automated Service Level Agreements (SLAs) have been proposed as a means to establish a common understanding of expectations and obligations of service providers and their customers. Specifically, automation refers to the electronic formalized representation of SLAs and the management of their lifecycle by autonomous agents. Such proposals have been linked particularly to cloud services. Previously, we have elaborated a utility architecture that optimizes resource deployment according to business policies and a mechanism for optimized SLA negotiation [2]. In this work, we take all that a step further with the application of actor systems
2
K. Lu et al. / Future Generation Computer Systems (
)
–
Fig. 1. Evolution of SLA lifecycle management from three steps to six steps. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
as an appropriate theoretical model for fine-grained, yet simplified and practical monitoring of massive sets of SLAs. Using actor system as an implementation basis, the whole SLA management can be efficiently parallelized. In this work, we are using the actor system in Akka framework [3] for JVM-based actors. Comparing to other event-driven programming languages such as Esper [4] and Storm [5], actor systems build hierarchies of actors using strong parent–child relationships to achieve fault-tolerance. In this setting, each actor may have one or more child-actors that are responsible for their own specific functionality. When failure occurs in one of the child-actors, the failure will be propagated upwards until it is handled by predefined solutions [6]. Our approach separates the agreement’s fault-tolerance concerns and strategies, i.e. resource (re)-scheduling, outsourcing, (re)-negotiation, into multiple autonomous layers that can be hierarchically combined into an intuitive, parallelized, effective and efficient management structure. We show that this is a realistic approach for the automated management of the complete SLA lifecycle, including negotiation and provisioning, but also focus on monitoring as the driver of contemporary scalability requirements. The remainder of this paper is structured as follows. In Section 2, the derivation and the classification of SLA lifecycle management will be discussed. In Section 3, we will discuss prior art. Then the model of the problem is given in Section 4. In Section 5, a formal model of SLA negotiation scenario is provided. Section 6 gives a description of a formal model of SLA monitoring scenario. In Section 7, we validate our proposal via discrete event simulations. Finally, we conclude the paper in Section 8. 2. SLA lifecycle management SLAs relate by their very nature to different phases of a service lifecycle. Therefore, it is of fundamental importance to agree on a common reference model for such a service lifecycle which details the various phases, their stakeholders and their expected outputs in a common way without compromising the general design space for SLA management architecture. The overall lifecycle model has been influenced to a dramatic degree and is broadly in-line with the ITIL framework [7]. In general, the service lifecycle management considers six main phases, which are: design and development, service offering, service negotiation, service provisioning, service operations and service decommissioning. SLAs also play a central role in the service lifecycle, because by capturing service expectations and entity responsibilities they drive both engineering decisions at conception level (during for example service design) and operational decisions (during service usage and delivery) [8]. Although there is no standard definition, in recent years, various projects, research activities and companies provide the foundation for the state-of-the-art in SLA lifecycle management. For instance, Ron et al. define the SLA lifecycle in three phases [9]. Later on, more detailed classifications are proposed and outlined individually by SLA@SOI project [10], by Sun Microsystems [11] and by the TeleManagement Forum [12]. In general, SLA lifecycle management consists of three phases with corresponding color on
top of the Fig. 1, namely creation (red), operation (orange) and removal phases (blue), each of which can be further expanded to sub-phases with the same color. The SLA creation includes three sub-steps, i.e. discover service provider, SLA definition and SLA establishment. Once service providers are discovered, customers have to be aware of the detailed capacity of the service providers. Therefore, the service providers describe and define their services properly and deliver the definition of their services to the customers. Then, the customers further establish the agreement(s) with one or more service providers based on the service definition through a process of SLA negotiation. 3. Related works 3.1. Service fault-tolerant monitoring Currently, there are various kinds of service fault-tolerant monitoring tools, such as CloudWatch [13] from Amazon that provides monitoring for AWS cloud resources and the applications. Developers or system administrators can use it to collect and track metrics, gain insight and react immediately to keep their applications and businesses running smoothly. Likewise, by configuring Nagios [14], the entire IT infrastructure components could be monitored, traced, including various system metrics, applications, services and so on. In the end, Nagios can send alerts to administrators when critical infrastructure components fail. Similarly, through OpenTSDB [15], users are able to fetch the historical data of many system metrics, such as CPU, memory and hard disk utilization. However, all above tools are only designed for reporting a historical record of outages or failures and the failures however have to be resolved manually. Eventually, in any business that depends on computers, creating an infrastructure with clear procedures and preventive maintenance can reduce the likelihood of failures. In other words, how does the system understand the different types of possible failures and respond to the potential violation alarm. Therefore, recovery from failures must be planned and an autonomous service violation self-heal system is of paramount importance [16]. Some related fault-tolerant SLA management frameworks, such as [17], which presents a dependable QoS monitoring system called ‘‘QoSMONaaS’’, which relies on the as a Service paradigm, and can thus be made available to virtually all cloud users in a seamless way; [18] introduced a framework ‘‘SLAMonADA’’ for monitoring and explaining violations of WS-agreement-compliant documents; and [19] introduced an SLA-aware service virtualization architecture that provides non-functional guarantees in the form of SLAs and consists of a three layered infrastructure including agreement negotiation, service brokering and on demand deployment. However, they all only focused on SLA monitoring system to detect the SLA violations and corresponding penalty cost without considering the reaction process how to avoid the SLA violations. Furthermore, many works about SLA monitoring self-healing and autonomous adaption emerge as the times require, such as [20–26]. In detail, works [20,21], engaged the monitoring, analysis, planning and execution cycle to react against violations in
K. Lu et al. / Future Generation Computer Systems (
cloud computing. However, they applied the centralized monitoring and reacting systems in cloud computing. A central system that is designed to monitor all SLAs and react all detected violations is forecasted to be hard in terms of maintenance and adjustment, which may introduce too much overhead for system scalability and fault-tolerance issues. In [22], the authors introduced a self-healing SLA management framework, where the related SLAs communicate with each other hierarchically. And [23] presents a monitoring infrastructure, a generative approach for the adaptive multi-source monitoring of SLAs in service choreographies. More specifically, through event-based monitoring framework (Ganglia), authors introduced an online low level infrastructure adaption mechanism for the sake of avoiding SLA violation. In [24] is proposed a selfadapting Web-based applications approach by only rearranging the computational resources allocated to the service. And in [25], PREeven, an event-based monitoring, prediction of SLA violations, and automated violations prevention framework was proposed. Similarly, Emeakaroha et al. [26] proposed CASViD: an application level SLA monitoring and detection framework. In these five papers, the monitoring activities are not centralized rather the SLA and the service is 1:1 relationship. However, the scope of handling SLA violation is on how to avoid violation propagation between different domains and the adaptation actions for violation handling are tedious. Besides, little attention has been paid to SLA management itself, e.g., how is SLA negotiated, how does monitoring be carried out, how to classify and trigger right event for handling violation. In some works [26,27], Java Messaging Service (JMS) was used as communication mechanism together with ActiveMQ [28]; however, JMS implementation requires an external broker, where Akka is able to run without an external broker. Besides, due to the lightweight structure of actors, Akka application outperforms JMS application in latency and throughput. For instance, one actor only occupies 600 Byte memory and 8 GB memory will include 13 million actors [3]. 3.2. Optimization strategies in SLA lifecycle management Based on our proposed actor based framework, the corresponding strategies could be designed and applied into different phases of SLA lifecycle. In SLA negotiation phase, paper [29] adopts some concepts such as customer classes and price discrimination, resource reservations and quality of service, to integrate with policies (SLAs being a form of those) and resulting in a job acceptance model that maximizes the provider’s profit. The main difference to our approach is that we assume the customer to require the resources immediately, and if we do not have enough, we try to find them from others and outsource so that we sustain him. Conversely, the authors of [29] are performing full (local) resource scheduling. [30] uses applicationspecific performance models to achieve SLA compliance and optimal resource allocation. Similarly, [31] is exploring optimal application deployments so that SLAs for the application are not violated, focusing on service compositions and using a custom genetic algorithm. Conversely, we do not assume any knowledge about the executing application, and we are concerned with SLAs for the infrastructure service (i.e. the resources themselves). The customer already knows the amount of resources necessary, and we use that information to find the resources and satisfy the request. [32] proposes to use inventory control mechanisms to manage computing capacity of an application service provider (ASP). Their model is restricted by a fixed upper bound of the available resources at an ASP. In contrast, our model leverages that bound by considering outsourcing to other providers if the demand cannot be served locally. Moreover, the RESERVOIR project [33] proposes an architecture for managing infrastructure-level SLAs and for federating cloud resources. In contrast, our work provides a sound model for establishing SLAs including subcontracting. In SLA monitoring phase, in
)
–
3
[34,35], the authors mainly propose several VM consolidation strategies in CloudSim, namely, a resource management, either in simulation or some prototype implementations within the common cloud middleware. Nevertheless, no SLA lifecycle management is introduced in this cloud simulation environment. Furthermore, authors use VM live migration to leverage the consolidation and service performance; however, the migration was applied without taking service availability into consideration. In this paper, we propose a proof-of-concept prototype of SLA lifecycle management by providing service performance and availability oriented VM live migration and automatic cloud resource scalability. Last but not least, through our ‘‘floating area’’ concept in business modeling during SLA negotiation, an additional resource investment could be considered and applied into CloudSim in order to achieve less SLA violation and keep the Return on Investment (ROI) in an acceptable level. 4. Decentralization of SLA management using actor system In all six steps of SLA real-time lifecycle management, by applying actor System we are mainly considering:
• Definition of SLA, this step can be named as SLA negotiation step, in which customer negotiates the required service with service provider. • SLA establishment, where an SLA template is established and filled in by specific agreement, and service provider starts to commit to the agreement. • SLA monitoring and violation, while the service is consumed by users, it is monitored by service provider as well for detecting violation if any. Specifically, in one JVM, an actor system [3] consists of many actor objects, in which threads are transparently encapsulated. Actor is an entity of computation that communicates with other actors exclusively through message passing strategy. Usually, it encapsulates state and action. In technical programming, this means processing events and generate responses in an eventdriven manner. Every time a message is processed, it is matched against the current action of the actor. Thus, action is a function, which defines the strategies to be taken in reaction to the coming message. As we will argue in this work, this paradigm is much more suitable for SLA real-time management, because it enables the SLA manager to obtain the appropriate high-level view, which allows the management of complex real-time changes of SLA requirements. Moreover, in ITSM, it frequently happens that one level is unable to resolve the incident in the first instance and so has to turn to a specialist or superior who can make decisions that are outside the Service Desk’s area of responsibility. This process is referred to as escalation [1]. Actually, the actor system matches the escalation strategy above and is able to decentralize the whole SLA management, since it builds hierarchies of actors using strong parent–child relationships to resolve incident. In this setting, each actor may have one or more child-actors that are responsible for their own specific authority and functionality. When failure occurs in one of the child-actors, the failure will be propagated upwards until it is handled by predefined solutions [6]. Our approach separates the agreement’s fault-tolerance concerns and strategies, i.e. resource (re)-scheduling, outsourcing, (re)-negotiation, into multiple autonomous layers that can be hierarchically combined into an intuitive, parallelized, effective and efficient management structure. Therefore, we propose the following (see Fig. 2):
• There is one actor system for maintaining the whole SLA system, it includes one or more SLA management context.
4
K. Lu et al. / Future Generation Computer Systems (
)
–
Fig. 3. Finite state machine diagram of QoS actor.
Fig. 2. Decentralization of SLA management using actor system.
• Customer and SLA management is many-to-many relationship. • By decentralizing the whole SLA management task, including SLA negotiation, provisioning, monitoring and adjustment, into hierarchical actors with states and actions, i.e. customer, SLA Manager (SLAM), SLA as well as QoS, each actor will fulfill its own duty independently. • One actor which is assigned to oversee a certain function in the program could split up its task into smaller, more manageable pieces until they become small enough to be handled. Each actor has exactly one supervisor, which is the actor that created it (blue arrow). Therefore, the construction of the actor system is actually the negotiation of SLA. In the end, the whole SLA management actors will be constructed in a top-down manner and shaped as a tree structure. • During the SLA monitoring and violation detection phase, the corresponding pre-defined strategies could be performed in different layers of actors in the manner of hierarchical escalation. If one actor cannot deal with a certain situation, a corresponding exception message will be forwarded to its supervisor in a bottom-up manner (red arrow). We believe that via the integration of these basic real-time control mechanisms with the SLA-level policies for reaction of the messages inside an actor, more complex real-time requirements during SLA monitoring can be met with more flexibility and finer granularity [36]. As we already introduced in Section 4, the actor system is designed naturally in an event-driven manner that each actor object encapsulates state and action. Namely, the state of an actor is influenced by certain action. To facilitate the introduction, we will illustrate each actor with a directed finite state diagram. 4.1. Modeling of Akka actors We use deterministic finite state machine to mathematically model the actors into a 5-tuple (A, S , ss , se , δ), where:
• A is the action with a finite, non-empty set of input objects, (a1 , . . . , an ). • S is a finite, non-empty set of states, (ss , s1 , . . . , sn , se ). • ss is the initial state, an element of S. Thus, ss ∈ S. • se is the end state, an element of S. Thus, se ∈ S. • δ is the state-transition function, δ(sx , ax ) → sy , where sx and sy ∈ S , ax ∈ A.
In a finite state machine, each of the states will be represented by a rectangular node. Edges with arrows show the transitions from one state to another. Each arrow is labeled with the input that triggers that transition. Inputs that keep the actor in the original state will be represented by a circular arrow. The black node indicates it is the initial state. And the black node with a circle surrounded is the ending state. 4.2. QoS actor finite state machine As Fig. 3 illustrated, the state is changed by invoking certain action, δ(ss , a1 ) → s1 . During SLA negotiation, some non-functional properties of service (i.e. the feasibility of QoS) will be evaluated. The QoS terms can vary depending on the domain, namely different domains could have different QoS metrics. And each SLA actor may contain one or more QoS actors. The negotiation could also be terminated (δ(s1 , a2 ) → se ) as long as the timeout expires or one of two negotiating parties withdraws. Once customer accepts the offer, then QoS actor will turn its state from negotiation to monitoring state (δ(s1 , a3 ) → s2 ). In this state, the corresponding QoS will be monitored. If at some point of time a violation is detected, the predefined self-healing action will be activated in order to uphold the negotiated QoS (δ(s2 , a4 ) → s2 ). Furthermore, when an SLA is completely fulfilled, its corresponding QoS actor(s) will also be automatically terminated (δ(s2 , a5 ) → se ). We do not define a ‘‘violation’’ state for QoS actor. It might be the case that a QoS actor cannot handle the violation, for instance, the VM is not reconfigurable or the migration is impossible. As such, the QoS actor will inform its parent actor, i.e. SLA actor, via an exception message to fix it. 4.3. SLA actor finite state machine In Fig. 4, once an SLA manager is established, accordingly one or more attendant SLAs will be created and set in negotiation phase (δ(ss , a1 ) → s1 ). Afterwards, all the SLAs are just potential offers for the customer and they will be automatically turned into frozen state (δ(s1 , a3 ) → s2 ). Actually, there could be more than one potential solution that is suitable for both customer and service provider. Finally, an optimized offer will be selected by SLA manager and forwarded to the customer. Once the offer is accepted, the selected SLA will be turned into active state (δ(s2 , a4 ) → s3 ) and all the other SLAs will be kept in a list. While a service is successfully deployed, which means an SLA is running, violations might be detected by its child / children. In this case, the predefined self-healing action is activated in order to keep the SLA in active state (δ(s3 , a5 ) → s3 ). For instance, one of the frozen SLA in the list will be selected and turned active, violated SLA actor will be turned frozen and put into the list (δ(s3 , a6 ) → s2 ). Apparently, the negotiation could be canceled by any party (δ(s1 , a2 ) → se ,
K. Lu et al. / Future Generation Computer Systems (
)
–
5
Fig. 4. Finite state machine diagram of SLA actor. Fig. 6. Finite state machine diagram of customer actor.
Fig. 5. Finite state machine diagram of SLA manager actor.
δ(s2 , a8 ) → se , δ(s3 , a7 ) → se ) in any state of SLA. Again, we do not define a ‘‘violation’’ state for SLA actor. It might be the case that SLA actor cannot handle the violation, for instance, no alternative solution is available. As such, SLA actor will inform its parent actor, i.e. SLA manager actor, via an exception to handle it. 4.4. SLA manager actor finite state machine Similarly, as Fig. 5 depicts, when SLA negotiation starts (δ(ss , a1 ) → s1 ), it may last several rounds (δ(s1 , a2 ) → s1 ), and all of them may require certain iterations. At some point, a final agreement is reached (δ(s1 , a4 ) → s2 ). The negotiation could also be terminated (δ(s1 , a3 ) → se ) as long as the timeout expires or one of two negotiating parties withdraws. Furthermore, when an SLA is completely fulfilled, it is automatically terminated (δ(s2 , a6 ) → se ). Sometimes, an SLA violation could be avoided by SLA manager in its execution state (δ(s2 , a5 ) → s2 ). In SLA manager actor, there is also no ‘‘violation’’ state. It might be the case that SLA manager actor cannot handle the violation, for instance, no appropriate third party is available for performing outsourcing. As such, customer actor will be informed in the end. 4.5. Customer actor finite state machine Actually, customer could be an organization, human or even an SLA manager. To facilitate the simulation we model the customer
as an actor (Fig. 6). Specifically, a customer initiates the SLA negotiation (δ(ss , a1 ) → s1 ). Two or more parties negotiate SLAs based on their individual goals. The negotiation may last several rounds (δ(s1 , a2 ) → s1 ) to negotiate offers or counter-offers. The negotiation could also be terminated (δ(s1 , a3 ) → se ) as long as the timeout expires or one of two negotiating parties withdraws. After the negotiation, an agreement could be finally reached (δ(s1 , a4 ) → s2 ). Re-negotiating an SLA (δ(s2 , a5 ) → s1 ) could modify its corresponding running service when the SLA is about to be violated. Furthermore, when an SLA is completely fulfilled, it will be automatically terminated (δ(s2 , a6 ) → se ). In this model, customer actor will be the last station, to which a violation exception broadcasts. SLA violation will be finally inevitable if the customer refuses to renegotiate (δ(s3 , a8 ) → se ). Until now, we say violation occurs. Service provider will fulfill its obligation of compensation in the light of the signed agreement (δ(s2 , a7 ) → s3 ). 4.6. Autonomous SLA violation monitoring and filtering Thereby, based on the finite state machines in above sections, an autonomous SLA violation-filtering framework is presented. In Fig. 7, four modules are classified and defined individually into four layers. In this framework, each module has exactly one supervisor, which is the module that creates it. If one module does not have the means for dealing with a certain situation, it sends a corresponding exception message to its supervisor, asking for help. The recursive structure then allows handling failure at the right level. Everything in this framework is designed to work in a complete distributed environment. As thus, all interactions of modules are pure message passing, which is exchanged in non-blocking, asynchronous and highly performant event-driven manner. In message passing programming, an asynchronous call allows the caller to progress after a finite number of steps, and the completion of the method may be signaled via some additional mechanism (it might be a registered callback or a message). Furthermore, non-blocking means that no thread is able to indefinitely delay others. Non-blocking operations are usually preferred to blocking ones, as the overall progress of the system is not trivially guaranteed when it contains blocking operations [3]. Therefore, the SLA violation could be efficiently eliminated.
6
K. Lu et al. / Future Generation Computer Systems (
)
–
Fig. 7. Autonomous SLA violation filtering framework.
Fig. 8. Negotiation scenario.
Consequently, each module strives to resolve the (potential) violation on its own layer. At the very beginning, when a certain violation (e.g., service performance issue) is detected by QoS module, it tries to fix it by restarting the service or adding extra resources on the running service and without altering the content of SLA. In this case, the customer does not feel any change and the running SLA is still valid. Otherwise, it sends an exception message to its supervisor (SLA module). In SLA module, the corresponding exception handlers have already been predefined with all possible solutions in a list. Likewise, if SLA module cannot find an alternative solution, it forwards the exception to its supervisor (SLA manager module). SLA manager module outsources to other cloud third party for extending its capacity. Otherwise, a renegotiation could be initialized in customer module so as to establish a new SLA without paying the penalty. And there could also be several rounds of renegotiation within a certain time span. In the following sections, we will explain how the actors are interacting with each other by considering different scenarios. 5. Modeling of negotiation scenario In Fig. 8, the customer may initialize the SLA negotiation with SLA manager by sending pre-customized SLA template. An SLA template is a document that describes a set of options for each domain specific service including functional and non-functional properties (Step 1). In the IaaS scenario, the provider should check whether it is able to offer the required availability and/or performance for the specific service or not. For instance, based on the current load of host, the feasibility of allocating a VM in the host
translates to whether the required number of Millions Instructions Per Second (MIPS) is available or not. In case the host is almost overloaded, it should be ignored during the negotiation. When an SLA manager is established (Step 2), it will split the Planning and Optimization (PO) (Step 3) task(s) into one or more attendant SLA actor(s). According to the resource configuration, each SLA actor will start evaluation simultaneously and finally come up with a price quotation. IT service management in the broader sense overlaps with the disciplines of business service management and IT portfolio management, especially in the area of IT planning and financial control [1]. Namely, efficiently leveraging the IT business modeling that is subject to IT resource management is the chief task during SLA negotiation. First, let us assume resource types R1 , R2 , . . . , Rn , each bearing certain characteristics that refer to inherent resource details, e.g., location of resource, performance of CPU cores, etc. Then, we are modeling cost of investment C i for a solution involving resource Ri as the sum of internal cost of investment CIi (i.e. resources utilized internally) and external cost of investment CEi (i.e. sub-contracted resources, explained in detail in Section 6.3). C i is then modeled after the base cost per resource unit CBi for a solution given standard (baseline) QoS. CBi is multiplied by a factor σ i that models the cost of increased QoS as it is given within the quote/SLA request by means of conditions Q1i , Q2i , . . . , Qrii for the qualitative characteristics of the resources (Eq. (1)), and applying price reductions for bulk purchases (i.e. large values for resource quantity N i ) (Eq. (2)). Finally, the total cost of investment C is the sum of cost for all resources (Eq. (3)). F ∗ represents a minimum acceptable profit, that depends on the customer’s profile; and PVi∗ represents a maximum acceptable failure probability, which may also be associated with specific customers or other business conditions at the time of negotiation. The problem we are trying to solve in PO is to minimize cost while profit and the probability of failure remain within acceptable limits as dictated by high-level business rules (Eqs. (4) and (5)).
σ i = [1 − di (N i )] · f i (Q1i , Q2i , . . . , Qrii )
(1)
=N ·σ · C = (1 + β i ) · CIi + CEi
(2)
CIi
i
i
CBi
(3)
i
i
g i (CIi , CEi ) − β i · CIi ≥ F ∗
(4)
K. Lu et al. / Future Generation Computer Systems (
)
–
7
Fig. 9. Function h in relation to β values. Fig. 11. Violation handling in QoS module.
of investment cost. Thus, we can further extend the ROI as Eq. (9) for resource i and as Eq. (10) for all resources: ROI i =
g i (CIi , CEi ) − β i · CIi F
ROI = i
Fig. 10. βmax for gaining enough profits and βmin for keeping failure rate low. (a) without intersection area, (b) with intersection floating area.
(9)
(1 + β i ) · CIi + CEi + Pi (QoS1 , . . . , QoSt ) i i
C i f i (Q1i , Q2i , . . . , Qrii ) +
Pi
=
F C +P
.
(10)
i
After SLA manager collects all the quotations from SLA actor(s), it will select the optimal one and forward it to the customer (Step 4). The latter may refuse the counter-offer (Step 5). Or else, if the offer is accepted (Step 6), this SLA will be turned into active and all the other SLAs will be kept as part of the selected SLA (Step 7). The final selected SLA will create its own QoS actors accordingly (Step 8). 6. Modeling of service monitoring and fault tolerance scenario
hi (β i , T i ) ≤ PVi∗
(5)
0 ≤ β i,
(6)
∀i.
Eq. (4) can be transformed to be Eq. (7).
βmax ≤
F − F × F∗ C
.
(7)
We can further transform Eq. (5) as:
βmin
(Tself − Tmax ) × βlimit ≥ . Tself − Tmin
6.1. Violation handling in QoS module Two QoS terms are considered, namely service availability and service performance subject to MIPS. The frequency of the servers’ CPUs can be mapped into MIPS ratings. Server CPU utilization is formulated as below: Server Utilization =
(8)
Specifically, Eq. (8) takes values between the current (statistical) failure rate Tself and a very small value (for instance 10−2 ), which becomes effective when β reaches a value of βlimit (Fig. 9). In other words, values of β larger than βlimit (more than the normal implementation costs) make no difference to the probability of failure. Therefore, if there is no intersection between Eqs. (7) and (8) as Fig. 10(a) illustrates, which means service provider cannot offer the service with acceptable failure rate and profit. In this case, service provider either has to renegotiate with customer for a new service or refuse the coming request. On the other hand, there is such an intersection (Fig. 10(b)); we name it as floating area. According to our strategy, service provider will select the value of βmin for the sake of maximum profit and acceptable failure rate. Consequently, the whole problem could boil down to how to maximize the value of Eq. (10) for sake of gaining optimal ROI. In ROI model, other than implementation cost and profit, in case the service is violated, it will be service providers responsibility to pay for the compensation, i.e., penalty. This will be definitely influencing the implementation cost and should be treated as part
MIPS utilized MIPS total
(11)
where MIPS utilized is the utilized MIPS of the server and MIPS total is the capacity of the server. For instance, a HP ProLiant ML110 G5 server with 2 cores and each core is allocated 2660 MIPS, i.e. 5320 MIPS capacity per server. Once several number of VMs are allocated into the server, at certain point of time, the total utilized MIPS is 5317. Thus, the server utilization is 99.94%. Then in this case we say the server is over-utilized. And we define that SLA performance violation occurs when a VM cannot get the requested amount of MIPS. This can happen in cases when VMs sharing the same host require higher CPU performance that cannot be provided due to VM consolidation [35]. Through Nagios we can monitor the real-time service availability, if violation is detected, instead of automatically sending an alert email to administrator being handled manually, in this case, QoS actor will receive the exception, then according to the causes of the exception, the corresponding action will be triggered so as to resolve the issue. We propose that QoS actor will automatically restart the VM and reconnect to it within certain timeout (Step 1 in Fig. 11). Such operation could be maximally repeated 3 times (Steps 2, 4, 5, 6). For service performance, we feed the real-time resource utilization workloads from Nagios into CloudSim and applied its available resource allocation strategies, thus once host cannot
8
K. Lu et al. / Future Generation Computer Systems (
)
–
Fig. 12. Violation handling in SLA module.
provide enough resource to this VM (Step 7), i.e. less MIPS is allocated to the VM than expected, QoS actor will try to reconfigure the VM with new appropriate CPU flavor (Steps 8, 9), which is also dependent on the current utilization of host. Finally, a new configured VM will be assigned into the context of this SLA (Step 10).
If the service is still unavailable after three times reconnection, then the service will be migrated to other suitable host (Steps 1, 2 in Fig. 12). Similarly, when we cannot allocate more extra resources, VM live migration will also be applied (Steps 4, 5), thus in our scenario two types of the VM will be selected to be migrated, namely we will migrate the VM that is unavailable (e.g., server outage) in order to make the VM become available again; we will migrate the VM with less allocated MIPS in order to make the VM with enough CPU power supplied. Finally, a new VM will be assigned into the context of this SLA (Steps 3, 6). For VM migration, once we fix which VM will be selected, then the only problem is how to find a suitable server to host the VM by considering the energy consumption, SLA failure rate and so on. If we manage to find such a host for the migrated VM, then in this layer, we say the SLA violation subject to availability and performance can be resolved. However, short downtimes of services are unavoidable during VM live migration due to the overheads of moving the running VMs. Hence, the respective service interruptions in IaaS reduce the overall service availability and it is possible that the customers expectations on responsiveness are not met. Thus, it can be case that a VM cannot be migrated any more, in such situation, SLA will report an exception to its supervisor—SLA manager module. 6.3. Violation handling in SLA manager module SLA violations, if any, can be known in advance so that companies can take action to ensure their online reputation is not adversely affected. For instance, by migrating the violated service to another host that has more available capacity, or simply entertain the customer with compensation. Therefore, we say: resolved SLA
(12)
violated SLA
where Fxij being the historical SLA violation fixing rate. The higher the value of Fxij , the better after-sales service that such service provider is offering. Thus, we propose a three dimensional decision-making space: Service Provider Selection = Tji , (1 − Fxij ), Cji
where Cji being the price per resource unit from provider j and Tji being its historical failure rate. We can further formulize the distance from the ideal point by setting each metric with a quantification variable as:
6.2. Violation handling in SLA module
Fxij =
Fig. 13. Violation handling in SLA manager module.
(13)
D=
(θ ∗ Tji )2 + (λ ∗ (1 − Fxij ))2 + (σ ∗ Cji )2 .
(14)
For each provider in S, we compute its distance from the ideal point. After we choose the closest one, we strive to establish an agreement for a quantity that respects its declared capacity. If there is more than one with the same (smallest) distance, we choose either the cheapest per unit or the one with the largest capacity, depending on what is least expensive for the total quantity. In the extreme case that costs are the same, we choose one at random. If at the end of this process there are still unassigned resources, we remove currently utilized subcontractors from S and repeat with the new excess amount until there are no unassigned resources. The process is executed for all resource types for which we do not have enough local capacity. Therefore, in this case, θ + λ + σ = 1. (0, 0, 0) implies the ideal service with no SLA violation and price is free. (1, 0, 0) means that customers only concern about failure rate. Similarly, (0.5, 0.3, 0.2) means failure rate is more important than other facets. Thus, it is more applicable to customize the service provider selection by assigning the concerned metrics with weights. In actor system, using its remote feature could extend the capacity of SLA management (Step 1). In our proposal, one JVM includes one or more SLA managers; different actors in multiple JVMs could communicate with each other by sending message. In this case, remote actor references represent actors, which are reachable using remote communication, i.e. sending messages to them will serialize the messages transparently and send them to the remote JVM. This feature could be seamlessly applied into resource outsourcing strategies. Therefore, each JVM will be represented as a data center (simulated by CloudSim) located in different place. All the data centers will communicate (start SLA negotiation) via Internet (Step 2) in order to extend its resource capacity (Fig. 13). In case more than one party offer the same service, we will apply the 3 dimensional decision-making strategy, and finally one of them will be selected (Step 3). 6.4. Violation handling in customer module As Fig. 14 illustrated, the SLA violation could be removed by starting re-negotiation so as to ensure the online reputation of
K. Lu et al. / Future Generation Computer Systems (
)
–
9
7. Experimental evaluation
Fig. 14. Violation handling in customer module.
service providers. There is a necessity to add SLA renegotiation process in the SLA management for cloud-based system. This will allow customers and providers to initiate changes in the established agreement, for instance the storage size need to be resized as the excessively growing of data, the bandwidth as the capacity channel tends to be reduced in certain periods of time. It is evident that changing circumstances lead to the development of an SLA renegotiation [37]. Both parties in the contract should be allowed to carry out the initiation of renegotiation and that both parties should be allowed to cancel an existing contract, as this is what is allowed in the real world, after all. The initiation of re-negotiation should be allowed through non-binding enquiries to the other party so that an estimate as to how much it would cost to change the contract can be obtained before committing to a new contract [38].
For the sake of approving our proposed QoS model and its interrelationship with business strategies in Sections 5 and 6, we choose CloudSim as our simulation platform that is able to model and trace the energy consumption and SLA performance with automatic host over/under-loading detection, which reduces the preparation of our simulation work mainly to focusing on decision making of SLA violation and the corresponding adaption [34]. We simulated a data center, including 120 hosts. Each host includes 4 physical CPUs (pCPUs), each core with performance of 2000 MIPS, 8 Gb RAM and 1 TB storage. Virtualization technology allows us to define more virtual CPUs (vCPUs) than the available pCPUs. Usually, we allocate vCPUs onto the VM. While virtualizing, the expectation is that when one of the VMs requires increased processing resources, several others on the same host will be switched to be idle, which means all the VMs will share the MIPS of all pCPUs of a host. Therefore, the total number of vCPUs across all VMs on a host is potentially greater than the number of pCPUs. We assume the vCPU and pCPU allocation ratio is 1.25:1. Therefore, in total, there are 600 vCPUs available. Based on the Cloud infrastructure, we import the real workloads (CPU utilization) at GWDG in Germany into CloudSim. In this data center, 81 VMs will be created. The whole simulation time is 24 h (86 400 s). Once the cloud environment in CloudSim is setup, it will automatically allocate the workloads into the VMs. The interval of utilization measurements is 5 min. There will be four types of VM as presented in Table 1. As it was already defined in CloudSim, SLA violation Time per Active Host (SLATAH) is the percentage of time, during which active hosts have experienced the CPU utilization of 100%. And Performance Degradation due to Migrations (PDM) is the overall
a
b
c
d
Fig. 15. (a) Energy consumption, (b) SLA average violation, (c) Relationship between maximum beta and minimum beta, (d) Return of investment.
10
K. Lu et al. / Future Generation Computer Systems (
a
b
c
d
)
–
Fig. 16. (a–c) Optimal target host utilization for allocating the migrated VM(s), (d) Comparison with other strategies of CloudSim in energy consumption and SLA violation. Table 1 VM instance flavor subject to CPU, capacity, cost and profit. Name of instance
Capacity
Cost (e)
Profit (e)
Micro CPU instance Small CPU instance Medium CPU instance High CPU instance
250 MIPS 500 MIPS 750 MIPS 1000 MIPS
4 6 10 15
1.2 1.8 3 4.5
performance degradation by VMs due to migrations. As such, SLA performance violation is [35]: SLAV = SLATAH · PDM .
(15)
As we introduced in Section 5, there is such an intersection (Fig. 10(b)); we name it as floating area. Service provider will select the value of βmin for the sake of maximum profit and acceptable failure rate. However, we allow additional resources (and hence, lower profit) for customers, as we wish to improve the reputation of the provider. More specifically, when the service is violated or about to be violated, for instance, extra CPU MIPS will be allocated to the service in order to resolve the violation. This process is equal to move the value of β from βmin to βmax in the floating area. In the experiment, our objective is to see the changes of energy consumption, SLA average violation, the chosen value of β as well as ROI by investing additional different quantity of CPU
resource from 0 to 1000 MIPS with incremental change of 100 MIPS. By default, 0 additional MIPS is set in CloudSim. We have three customer classes; namely, Gold, Silver and Bronze. For Gold customers, F ∗ is 0.7 · g (CI , CE ); for Silver it is 0.8 · g (CI , CE ); and for Bronze it is 0.9 · g (CI , CE ). We are allowing additional resources (and hence, lower profit) even for bronze customers, as we wish to improve the reputation of the provider. By giving the maximum acceptable failure probabilities Tmax is 15%. The resource profit rate is 30%. The minimum profit rate F ∗ depends on the customer. Our target is to reach a failure rate equal to 10% or less, given the small gradual effect of β on the future violations, and achieve profit as much as possible. Finally, we got the following results in Fig. 15. We compare our approach with the strategy in CloudSim (0 additional resource investment) which illustrates by investing more resources on the violated service, the total energy consumption increases progressively (Fig. 15(a)), and the average SLA violation decreases dramatically (Fig. 15(b)). When we consider the profit and failure models, as Fig. 15(c) implies by investing more than 500 MIPS, βmin will be greater than βmax , meaning we cannot achieve enough profits and reasonable low SLA violation at the same time. Similarly, by investing more resources will definitely lead to less penalties, however an additional implementation cost will also be inevitably introduced, if this investment is more than 650
K. Lu et al. / Future Generation Computer Systems (
)
–
11
Fig. 17. Experiment results in service outsourcing.
MIPS (Fig. 15(d)), the value of ROI becomes less than the original ROI (41.8%), meaning the additional investment will not bring more benefits when we take the implementation cost, profits and penalties into consideration. In this scenario, an extra investment, i.e. 100 MIPS, will lead to the optimal result: the SLA average violation rate decreases to 10.5%, decrease of ROI (0.68%), increase of energy consumption (4.09%). The resource utilization of each individual service is diverse with the time of day. For instance, the peak access flow may happen around noon and less resource utilization is needed during midnight. Thus, it might be the case that service provider cannot afford as much Internet access as it promises in the SLA that is subject to the service performance, such as response time. Therefore, in QoS module, if we cannot allocate more extra resources on the service or the service is still unavailable after reconnection, VM live migration will be applied in SLA module. In SLA module, we strive to find which host will be the optimal destination for allocating the migrated VM(s). The host utility between 0% and 100% is divided equally into 200 intervals. By default, 80% is the host threshold utilization predefined in CloudSim. As illustrated in Fig. 16(a)–(c), migrating a VM to a host whose utility is the closest to the threshold utility, will lead to the least energy consumption and SLA performance violation. Because otherwise, some VMs from the source host will be migrated back and forth until one of them cannot be moved anymore. This not only increases the number of VM migrations unnecessarily and blocks reasonable future migrations due to the availability constraint, but also introduces further energy consumption. In Fig. 16(d), in our approach, i.e. AVL/AV, the energy consumption is 54.22 kWh. Whereas, the other approaches in CloudSim introduce around 37–51 kWh energy consumption and higher SLA performance violation. In SLA manager module, the ultimate goal of SLA outsourcing is to extend the capacity of the main service provider temporarily without paying service penalty due to the service violation, which is currently not supported in CloudSim. And to the best of our knowledge, no prior work has provided a consistent methodology for selecting infrastructure subcontractors and generating requests
to them. In this topic, third parties play a particularly important role in decision making for the intermediate SLA(s), which is (are) not transparent for the end user of the main service provider. In our simulation, we set up four different data centers in different JVMs, namely, main site with 600 vCPUs, 2nd subcontractor with 610 vCPUs, 3rd sub-contractor with 650 vCPUs and 4th sub-contractor with 950 vCPUs. Through actor system, all the data centers will communicate (start SLA negotiation) in order to extend its resource capacity. Each data center will apply the same business strategies as we introduced in Section 5. While the data center is running, the corresponding SLA violation, cost as well as SLA violation fixing rate will be updated and published, which could be further applied into service provider selection of the main site. To be specific, in case more than one party offer the same service, we will apply the 3 dimensional decision-making strategy, and finally one of them will be selected. Fig. 17 illustrates one of the most important results of the simulation. These four plots show the available resources over time, for the main site with 600 vCPUs at very beginning and the three sub-contractors. We can see that as soon as the main site’s resources become depleted, it is mostly the 3rd subcontractor that is being utilized, as it is the ‘‘optimal’’ from a price and failure rate point of view. Following, the 2nd and the 1st are utilized when the 3rd has no more available resources. SLA renegotiation in customer module implies further round(s) of SLA negotiation, during which one or more terms of an established agreement may be changed. A successful renegotiation assesses the existing agreement and replaces it with a new one. The entire interaction is performed between customer and service provider. For example, a user may need to change the amount of CPU time they originally agreed to purchase if they find their computational job is more complex than they originally thought and it requires more CPU time to fully complete [38]. On the other hand, service provider has to trigger the SLA renegotiation with a new counter-offer in order to adapt the change of its resource capacity. Therefore, modeling of the behavior or preferences of the customer is out of the scope of the paper, in our implementation,
12
K. Lu et al. / Future Generation Computer Systems (
we treat the SLA renegotiation as a new round of SLA negotiation, and we leave the decision to both customer and service provider, which will unquestionably provide such an opportunity for both sides of the negotiating parties that is able to adjust to their new requirements or resource allocation scheme and in the end to avoid the SLA violation and the further penalties. 8. Conclusion and future work In this paper, we propose a realistic, modern SLA management system with the application of actor systems as an appropriate theoretical model for fine-grained, yet simplified and practical, monitoring of massive sets of SLAs. We show that this is a realistic approach for the automated management of the complete SLA lifecycle, including negotiation and provisioning, but focus on monitoring as the driver of contemporary scalability requirements. Our approach separates the agreement to build hierarchies of actors using strong parent–child relationships to achieve faulttolerance concerns into multiple autonomous layers that can be hierarchically combined into an intuitive, parallelized, effective and efficient management structure. In the future, we will apply Akka system to multi-cloud structure, i.e. the relationship between IaaS, PaaS and SaaS. e.g., SLAs in SaaS will be failed if its related SLA in PaaS is violated. So, lack of an effective relation between dependent SLAs is a vital challenge, which makes the SLA management system inefficient. By using Akka system, the SLAs between domains can communicate with each other in terms of handling SLA translation, composition and violation. Thus, the difficulty of cross-domain SLA management can be released.
References [1] J. Bartlett, Best practice for service delivery, The Stationery Office, 2007. [2] K. Lu, T. Roeblitz, P. Chronz, C. Kotsokalis, SLA-based planning for multidomain infrastructure as a service, in: Cloud Computing and Services Science, Springer Science+Business Media, New York, ISBN: 978-1-4614-2326-3, 2012. [3] Akka documentation, ‘‘Actor Systems’’, 2014. Available: http://doc.akka.io/docs/akka/2.3.2/general/actor-systems.html [Online]. [4] ‘‘Esper’’. Available: http://esper.codehaus.org [Online]. [5] ‘‘Storm’’. Available: https://github.com/nathanmarz/storm [Online]. [6] ‘‘Erlang programming language’’. Available: http://www.erlang.org/doc.html [Online]. [7] ITIL (IT Infrastructure Library): ITIL Version 3—Service Improvement, 2014. Available: http://www.itil-officialsite.com [Online]. [8] Cloud computing service level agreements. Exploitation of Research Results, 2013. Available: http://ec.europa.eu/ [Online]. [9] S. Ron, P. Aliko, Service level agreements. Internet NG project, 2001. [10] SLA@SOI, 2013. Available: http://sla-at-soi.eu/ [Online]. [11] L.L. Wu, R. Buyya, Service Level Agreement (SLA) in utility computing systems, Architecture, 27, 2010. [12] SLA Management Handbook, Vol. 2, Concepts and Principles, Release 2.5, The TeleManagement Forum, Morristown, New Jersey, USA, 2005. [13] CloudWatch, 2014. Available: http://aws.amazon.com/cloudwatch/ [Online]. [14] Nagios, IT Infrastructure Monitoring, 2014. Available: http://www.nagios. org [Online]. [15] OpenTSDB, The Scalable Time Series Database, 2014. Available: http://opentsdb.net [Online]. [16] Peter S. Weygant, Basic high availability concepts, in: Clusters for High Availability: A Primer of HP Solutions, second ed., Prentice Hall, 2001, pp. 1–39. ISBN:0-13-089355-2. [17] G. Cicotti, S. D’Antonio, R. Cristaldi, A. Sergio, How to monitor QoS in cloud infrastructures: The QoSMONaaS approach, in: Intelligent Distributed Computing VI, Springer, 2013, pp. 253–262. [18] C. Muller, M. Oriol, M. Rodriguez, X. Franch, J. Marco, M. Resinas, A. RuizCortes, SLAMonADA: A platform for monitoring and explaining violations of WS-agreement-compliant documents, in: Principles of Engineering Service Oriented Systems, PESOS, 2012, pp. 43–49.
)
–
[19] A. Kertesz, G. Kecskemeti, I. Brandic, Autonomic SLA-aware service virtualization for distributed systems, in: Parallel, Distributed and Network-Based Processing, PDP, 2011, pp. 503–510. [20] A.L. Freitas, N. Parlavantzas, J.L. Pazat, A QoS assurance framework for distributed infrastructures, in: Proceedings of the 3rd International Workshop on Monitoring, Adaptation and Beyond, 2010, pp. 1–8. [21] V.C. Emeakaroha, I. Brandic, M. Maurer, S. Dustdar, Low level metrics to high level SLAs-LoM2HiS framework: Bridging the gap between monitored metrics and SLA parameters in cloud environments, in: International Conference on High Performance Computing and Simulation, HPCS, 2010, pp. 48–54. [22] A. Mosallanejad, R. Atan, M.A. Murad, R. Abdullah, A hierarchical self-healing SLA for cloud computing, Int. J. Digit. Inf. Wirel. Commun. 4 (1) (2014) 43–52. [23] A. Bertolino, A. Calabro, G. De Angelis, A generative approach for the adaptive monitoring of SLA in service choreographies, in: 13th International Conference Web Engineering, 2013, pp. 408–415. [24] J.P. Magalhaes, L.M. Silva, A framework for self-healing and self-adaptation of cloud-hosted web-based applications, in: IEEE International Conference on Cloud Computing Technology and Science, 2013, pp. 555–565. [25] P. Leitner, A. Michlmayr, F. Rosenberg, S. Dustdar, Monitoring, prediction and prevention of SLA violations in composite services, in: 2010 IEEE International Conference on Web Services, 2010, pp. 369–376. [26] V.C. Emeakaroha, T.C. Ferreto, M.A.S. Netto, I. Brandic, C.A.F. De Rose, CASViD: application level monitoring for SLA violation detection in clouds, in: Proceedings of 36th IEEE International Conference on Computer Software and Applications, COMPSAC, 2012, pp. 499–508. [27] D. Liu, U. Kanabar, C.H. Lung, A light weight SLA management infrastructure for cloud computing, in: 26th IEEE Canadian Conference of Electrical and Computer Engineering, CCECE, 2013. [28] ‘‘Apache ActiveMQ’’. Available: http://activemq.apache.org/ [Online]. [29] T. Püschel, D. Neumann, Management of cloud infrastructures: Policy-based revenue optimization, in: International Conference on Information Systems, ICIS 2009, 2009, pp. 10–25. [30] H.N. Van, F.D. Tran, J.M. Menaud, SLA-aware virtual resource management for cloud infrastructures, in: International Conference on Computer and Information Technology, IEEE Computer Society, 2009, pp. 357–362. [31] H. Wada, J. Suzuki, K. Oba, DAML-QoS ontology for web services, in: Proceedings of the 2009 Congress on Services-I-Volume 00, IEEE Computer Society, 2009, pp. 661–669. [32] J.L. Hellerstein, K. Katircioglu, M. Surendra, A framework for applying inventory control to capacity management for utility computing, in: Integrated Network Management, IM 2005, 2005, pp. 237–250. [33] B. Rochwerger, D. Breitgand, E. Levy, A. Galis, K. Nagin, I. Llorente, R. Montero, Y. Wolfsthal, E. Elmroth, J. Caceres, M. Ben-Yehuda, W. Emmerich, F. Galan, The RESERVOIR model and architecture for open federated cloud computing, IBM J. Res. Dev. (2009). [34] A. Beloglazov, J. Abawajy, B. Rajkumar, Energy-aware resource allocation heuristics for efficient management of data centers for cloud computing, Future Gener. Comput. Syst. 28 (5) (2012) 755–768. [35] A. Beloglazov, B. Rajkumar, Optimal online deterministic algorithms and adaptive heuristics for energy and performance efficient dynamic consolidation of virtual machines in cloud data centers, in: Concurrency and Computation: Practice and Experience, Wiley Press, New York, USA, 2011, http://dx.doi.org/10.1002/cpe.1867. ISSN:1532-0626. [36] N. Behrooz, de B. Frank S, M.J. Mohammad, The future of a missed deadline, in: Coordination Models and Languages, in: Lecture Notes in Computer Science, vol. 7890, 2013, pp. 181–195. [37] A.F.M. Hani, I.V. Paputungan, M.F. Hassan, Service level agreement renegotiation framework for trusted cloud-based system, in: Future Information Technology, in: Lecture Notes in Electrical Engineering, vol. 276, 2014, pp. 55–61. [38] P.M. Hasselmeyer, P. Koller, P. Wieder, An SLA re-negotiation protocol, in: 2nd Non Functional Properties and Service Level Agreements in Service Oriented Computing Workshop, NFPSLA-SOC 2008, Ireland, 2008.
Kuan Lu received his master degree in computer engineering from Duisburg-Essen University, Germany, and his computer science doctoral diploma from the University of Goettingen, Germany. He has published articles in several refereed international conferences and journals within the areas of planning and optimization for service level agreements of cloud computing, resource scheduling and allocation management. He also has been actively involved in the EU FP7 projects of PaaSage and SLA@SOI, the projects of Lotus Mashup and MILA at IBM Research and Development GmbH in Germany. Now Mr. Lu is working as software engineer of cloud lifecycle management at SAP SE in Germany.
K. Lu et al. / Future Generation Computer Systems ( Ramin Yahyapour is full professor at the Georg-August University of Goettingen. He is also managing director of the GWDG, a joint compute and IT competence center of the university and the May Planck Society. Dr. Yahyapour holds a doctoral degree in Electrical Engineering and his research interest lies in the area of efficient resource management in its application to service-oriented infrastructures, clouds, and data management. He is especially interested in data and computing services for eScience. He gives lectures on parallel processing systems, service computing, distributed systems, cloud computing, and grid technologies. He was and is active in several national and international research projects. Ramin Yahyapour serves regularly as reviewer for funding agencies and consultant for IT organizations. He is organizer and program committee member of conferences and workshops as well as reviewer for journals. Philipp Wieder is deputy leader of data center of the GWDG at the University of Goettingen, Germany. He received his doctoral degree at TU Dortmund in Germany. He is active in research on clouds, grid and service oriented infrastructures for several years. His research interest lies in distributed system, service level agreements and resource scheduling. He has been actively involved in the FP7 IP PaaSage, SLA@SOI and SLA4D-Grid projects.
Edwin Yaqub is a research associate at GWDG and Ph.D. candidate at the University of Goettingen, Germany. He holds a Master degree in Computer Science from RWTH Aachen University, Germany. His research interests include Distributed Systems and Cloud Computing with focus on automated SLA management, SLA negotiations, planning and optimizing resource allocation. He has been actively involved in the EU FP7 projects SLA@SOI and PaaSage. He has worked on diverse domains including health information systems at Philips Research Laboratories Aachen and cooperative vehicle systems at Siemens, Aachen.
)
–
13
Monir Abdullah is an associate professor at the Information Technology Department, Thamar University, Yemen. He has been a postdoctoral researcher at GWDG, University of Goettingen, Germany. He received his Ph.D. in parallel and distributed computing from Universiti Putra Malaysia, Malaysia 2009. He received the University excellence in teaching award at Multimedia University in 2009. His main research interests are in cloud/grid computing, resource scheduling, optimizations, parallel computing. He has published articles in several refereed international conferences and journals.
Bernd Schloer received his master degree in computational sciences from the University of Goettingen, Germany. His research interest lies in the areas of QoS monitoring as well as prototype of QoS Signalling Layer Protocol with the results published in international conference. Currently he is working as a system administrator at the University of Goettingen and at GWDG in Germany.
Constantinos (Costas) Kotsokalis received his electrical and computer engineering diploma from the National Technical University of Athens, Greece, and his computer science doctoral diploma from the Dortmund University of Technology, Germany. His expertise lies within the domain of distributed computing at large, with a focus mostly on service level agreements. Additional scientific topics of interest include distributed algorithms, job / resource scheduling and data management. Costas has worked in multiple international projects and diverse domains such as wide-area networking, grid / cloud computing, SLA management, remote instrumentation and e-Health.