Detection of transmissible service failure in distributed service-based systems

Detection of transmissible service failure in distributed service-based systems

Accepted Manuscript Detection of transmissible service failure in distributed service-based systems Dayong Ye, Qiang He, Yanchun Wang, Yun Yang PII: ...

923KB Sizes 0 Downloads 37 Views

Accepted Manuscript Detection of transmissible service failure in distributed service-based systems Dayong Ye, Qiang He, Yanchun Wang, Yun Yang

PII: DOI: Reference:

S0743-7315(18)30178-3 https://doi.org/10.1016/j.jpdc.2018.03.005 YJPDC 3855

To appear in:

J. Parallel Distrib. Comput.

Received date : 17 October 2017 Revised date : 26 December 2017 Accepted date : 10 March 2018 Please cite this article as: D. Ye, Q. He, Y. Wang, Y. Yang, Detection of transmissible service failure in distributed service-based systems, J. Parallel Distrib. Comput. (2018), https://doi.org/10.1016/j.jpdc.2018.03.005 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Journal of Parallel and Distributed Computing 00 (2018) 1–15

Journal of Parallel and Distributed Computing

Detection of Transmissible Service Failure in Distributed Service-based Systems Dayong Ye1 , Qiang He1∗, Yanchun Wang1 , Yun Yang2,1∗ 1

School of Software and Electrical Engineering, Swinburne University of Technology, VIC 3122, Australia 2

School of Computer Science and Technology, Anhui University, China

Abstract Detection of service failure, also known as service monitoring, is an important research problem in distributed service-based systems (SBSs). Failure of services is a transmissible threat in distributed SBSs, because services in distributed SBSs may have dependent relationships among them and thus the failure of one service may cause the failure of other services. Therefore, such transmissible service failure has to be detected in a timely manner whereas the corresponding resource consumption should be as little as possible. Most of the existing service monitoring approaches are centralised which suffer the potential of single point of failure and are not suitable in large scale distributed SBSs. Moreover, these centralised approaches are designed only in single-tenant SBSs. Nowadays, the scale of distributed SBSs is extremely large, i.e., including a large number of services and clients. Thus, it is essential for monitoring approaches to work well in large scale distributed SBSs and support multi-tenancy. Towards this end, in this paper, a novel agent-based decentralised service monitoring approach is developed in distributed SBSs. Compared to the centralised approaches, the proposed decentralised approach can avoid the single point of failure and can balance the computation over the monitoring agents. Also, unlike existing approaches which consider only single tenancy, the proposed approach takes multi-tenancy into account in distributed SBSs. Experimental results demonstrate that the proposed approach can respond as quickly as centralised approaches with much less computation overhead. c 2017 Published by Elsevier Ltd.

Keywords: Transmissible Service Failure, Service Monitoring, Distributed Service-based Systems, Multi-Agent Systems, Multi-Tenancy

1. Introduction

transmissible service failure which may cause the breakdown of an entire SBS [20, 21]. On the user side, when such anomalies Distributed service-based systems (SBSs) are becoming preva- occur, users may perceive that the response times of the serlent in various domains, e.g., cloud computing [1, 2, 3], softvices become longer or the services are unavailable [2, 3, 22]. ware systems [4, 5, 6, 7] and social networks [8]. These disIn order to guarantee the quality of services to satisfy user retributed SBSs are composed of various component services which quirements, the SBSs must be monitored and the transmissible can collectively offer qualified services to meet user requireservice failure has to be detected in an efficient way [23, 24]. ments [9, 10, 11, 12]. However, such distributed SBSs usuThen, in the case that anomalies are detected or predicted, adapally operate in volatile and dynamic environments, where varitation actions can be taken to address these anomalies so as to ous anomalies or threats may cause transmissible service failure keep the qualities of the SBSs [25, 26, 27]. A naive solution [13, 14, 15, 16, 17, 18], e.g., computer virus attack causing unfor service monitoring is to monitor all the component services expected cascading failures of services [19]. For example, an of an SBS constantly. This solution, however, will incur large SBS consists of three services, s1 , s2 and s3 , in a sequential ormonitoring costs: resource cost and system cost [28]. der, where the input of s3 is the output of s2 and the input of s2 is Resource cost includes software cost, hardware cost and huthe output of s1 . Now if s1 is attacked by computer viruses and man labour cost [29]. For example, a service provider may has errors in its output, then both s2 and s3 are affected. This is a maintain hundreds of thousands of services for the clients [30]. In this situation, constantly monitoring all the services incurs ∗ Corresponding authors. a large number of resource cost and sometimes, it is even unEmail Addresses: [email protected], [email protected], [email protected], [email protected]

1

D. Ye, Q. He, Y. Wang and Y. Yun / Journal of Parallel and Distributed Computing 00 (2018) 1–15

feasible. System cost refers to negative impacts on the quality of the monitored services and systems [4]. For example, in SBSs, monitoring component services sometimes includes retrieving logs of the behaviours of services and sniffing network traffic. Such operations may result in performance reduction up to 70 percents [31, 32, 33], which leads to the response time of services and systems significantly prolonging. Therefore, there should be a tradeoff between the benefit and cost of monitoring. More monitoring can obtain more information of the monitored targets so as to achieve more accurate detection and prediction. However, more monitoring can also incur more monitoring costs. Recently, many service monitoring approaches have been proposed [34, 35, 28, 36, 37]. Most of these approaches are centralised, where monitoring strategies are made by the central managers. Centralised approaches can be very efficient in small scale systems but may be less efficient in large scale distributed systems. For example, when a system has hundreds of thousands of services, it is a heavy workload for the central manager to conduct monitoring strategies, as the computation overhead is very large. In addition, centralised approaches have the potential of the single point of failure, which means that if the central manager is breakdown, the monitoring of the system stops correspondingly and the system becomes vulnerable to anomalies. Although some of existing service monitoring approaches are decentralised, they were developed in other fields, e.g., communication networks [38], rather than distributed SBSs. Moreover, the existing approaches are developed in single-tenant environments, which are not suitable for multi-tenant SBSs. Modern SBSs are large scale distributed and designed for multi-tenant environments. Thus, existing approaches may not be very efficient in modern SBSs. It is significant to develop a decentralised and multi-tenant-based service monitoring approach. In this paper, a novel agent-based decentralised service monitoring approach is proposed. In the proposed approach, each service class in distributed SBSs is monitored by an agent, which autonomously and dynamically decides the monitoring frequency based on the multi-agent learning technique. The novelty of the proposed approach includes two aspects which are claimed as a two-fold contribution of this paper.

2

The rest of the paper is organised as follows. The next section reviews existing related studies and compares the proposed approach with them. In Section 3, a motivation example is given to demonstrate the importance and applicability of this study. Then, in Section 4, the details of the proposed approach is provided. Section 5 evaluates the proposed approach and shows the efficiency of the proposed approach in comparison with two other approaches. Section 6 concludes the paper and outlines the future work. 2. Related Work

Service monitoring in SBSs has attracted much attention in the past years. Many efforts have been taken to develop service monitoring approaches [40, 41, 34, 35, 28]. Keller and Ludwig [40] developed a Web service level agreement framework. The framework is composed of an efficient language, which is based on XML Schema, and a runtime architecture which consists of several SLA monitoring services. In their framework, the SLA monitoring services are used to enforce the SLA to be conducted properly, where two signatory parties and two supporting parties collaborate to monitor an SLA. Later, Raimondi et al. [42] implemented automatic monitors for Web services from SLAs by using an Eclipse plugin and Apache AXIS handlers. Foster and Spanoudakis [34] designed an approach to support dynamic configuration of allocating different monitoring components to SLAs. Specifically, Foster and Spanoudakis used a mechanical parser to decompose each SLA into manageable monitoring targets and then employed a selection mechanism to allocate monitoring components to the targets. The service monitoring studied in [40, 42, 34] focuses on how to ensure no violation of SLAs, while our study aims at runtime service monitoring in order to detect service failure in a timely manner. Robinson [43] integrated methods of requirement analysis and software execution monitoring to develop a monitoring framework. The resulting framework can assist the development of Web service requirement monitors. Robinson used a goal-based method to discover obstacles and to illustrate the derivation of assigned monitors from the obstacles. Then, the Web service monitors can be automatically derived from high-level requirement description. Wang et al. [41] designed an online moni1. Unlike existing centralised approaches, the proposed aptoring approach for Web service requirements. This approach proach is decentralised which can avoid the single point involves a pattern-based specification of service constraints and of failure and can work efficiently in distributed SBSs. a monitoring model. The monitoring model covers five kinds This is achieved by enabling each individual agent to of system events: client request, service response, application, make monitoring strategies independently and autonomously. resource and management. The monitoring model also covers 2. Unlike existing approaches which consider only single a monitoring framework which consists of probes and agents to tenancy, the proposed approach takes multi-tenancy into collect the system events and sensitive data in the system. The account. This makes the proposed approach suitable for monitoring framework then analyses the collected information multi-tenant SBSs. against the service constraints to evaluate the behaviour of Web Here, an agent is an independent and intelligent entity which services so as to reveal any abnormality of Web services. In is able to make rational decisions autonomously in a dynamic [43, 41], Web services are monitored at the requirement level environment, namely being both proactive and reactive, showin a centralised manner, where a monitoring framework is reing rational commitments to decision making, and exhibiting sponsible for analysing and evaluating the behaviour of Web flexibility when facing an uncertain and changing environment services. In this paper, we study the service monitoring at the [39]. execution level in a decentralised manner. 2

D. Ye, Q. He, Y. Wang and Y. Yun / Journal of Parallel and Distributed Computing 00 (2018) 1–15

Meng and Liu [35] introduced the concept of monitoringas-a-service (MaaS). In their concept, MaaS can not only monitor system state and detect violation but also reduce monitoring cost, improve scalability and enhance the effectiveness of monitoring service consolidation and isolation. They developed three mechanisms to achieve these goals: 1) a windowbased violation detection mechanism, 2) a violation-likelihoodbased state monitoring mechanism and 3) a multi-tenant state monitoring mechanism. Multi-tenancy in [35] refers to the concurrent running of monitoring tasks, while in this paper, multitenancy refers to the ability to satisfy multiple clients simultaneously based on a single application instance [44, 45]. Calero and Aguado [46] presented a service monitoring architecture which offers a monitoring platform-as-a-service to each cloud consumer. Through this architecture, cloud providers have a complete overview of the architecture, while cloud consumers can access to the monitoring information of the status of the architecture and can set her/his services or resources to be monitored. Then, services are logically grouped in a hierarchical way to be monitored. In their work, complete view of the architecture has to be used by cloud providers, while in our approach, no global information is required. Zhang et al. [36] proposed a centralised service monitoring approach which assumes that some service parameters are inaccessible. Their approach, then, takes service dependencies into consideration. Thus, their approach can indirectly monitor inaccessible services by using collected information from other services. Specifically, their approach introduces utility to correlate the inaccessible parameters of a dependent service with the related accessible parameters of antecedent services. Then, their approach adopts a multi-step utility computing procedure to infer the inaccessible parameters value as the indirect monitoring result. Wang et al. [37] proposed a centralised criticality-based monitoring approach to formulate cost-effective monitoring strategies. The criticality of a service is evaluated based on its impacts on the quality of the SBS and the number of tenants that share the service. Then, the services with higher criticalities in an SBS are allocated more monitoring resource allocation. Other works which focus on violation detection include [47, 48, 49]. The approaches and mechanisms developed in these works are distributed but still require global information or need coordinators to organise their execution. In addition, Zhang et al. [50] proposed a monitoring SaaS (software-as-a-service) model and then used this model to predict available states of servers for dynamic task scheduling. Zhang et al.’s work focuses on task scheduling instead of monitoring approach design. Moreover, Chu et al. [51] presented topic models to discover service orchestration patterns from sparse service logs. They then employed an extended time-series form of model to monitor the dynamics of emerging service orchestrations. Thus, Chu et al.’s work focuses on monitoring the emergence of service orchestrations instead of monitoring service failures. In [28], we proposed a monitoring approach, CriMon, which can generate cost-effective monitoring strategies by calculating the criticalities of the execution paths and the component services of an SBS. Then, based on these criticalities, optimal 3

3

monitoring strategies can be generated by modeling the optimisation problem to an integer programming (IP) problem and using existing tools to solve the IP problem. In this paper, we extend our previous work [28] by allocating each service class a monitoring agent and enabling each agent to make monitoring strategies autonomously and individually. Thus, the proposed approach is decentralised which can avoid the single point of failure. In addition, we also extend our previous work [28] to multi-tenant SBSs, as multi-tenancy is a critical feature of modern distributed SBSs. 3. A Motivation Example T5: Cruise ticket search 1: WebJet 2: P&O Cruise

T6: Insurance quote T1: Airline ticket search

T2: Car rental

T3: Hotel search

T4: Train ticket search

3: Rail Plus Candidate Services (s1,1, s1,2, …)

Candidate Services (s4,1, s4,2, …)

Candidate Services (s3,1, s3,2, …)

Figure 1. A motivation example

Fig. 1 shows an example of a multi-tenant SBS that provides travel booking service. This example is from our previous work [12]. There are three tenants currently: 1) WebJet, a travel booking company offering flights, car rental and hotel bookings in Australia; 2) Rail Plus, the leading dedicated rail trip specialist general sales agency throughout Australia & New Zealand; and 3) P&O Cruise, Australia and New Zealand’s leading cruise line. In the future, new tenants may come and existing tenants may leave. This SBS serves tenants, i.e., travel agencies, by processing their users’ requests. A user enters her/his travel requirements, e.g., city of departure, destination, departure date, return date, preferred type of rental car, etc. In response to the request, the SBS returns a list of candidate travel plans for the user to book. The functionality of this SBS is represented as a business process that includes six tasks (T 1 , ..., T 6 ). For users of different tenants, the SBS performs different tasks to generate travel plans. For example, for Webjet’s users, the Airline Ticket Search, Car Rental, Hotel Search and Insurance Quote tasks are performed. For Rail Plus’s users, the Train Ticket Search and Hotel Search tasks are performed. For P&O Cruise’s users, the Cruise Ticket Search and Insurance Quote tasks are performed. All users from different travel agencies share not all but only some of the tasks. In this SBS, the Insurance Quote task is performed for users from both WebJet and P&O Cruise while the Hotel Search task is performed for users from both P&O Cruise and Rail Plus. To ensure the SBS persistently satisfying users’ requirements, services have to be monitored for detecting potential

D. Ye, Q. He, Y. Wang and Y. Yun / Journal of Parallel and Distributed Computing 00 (2018) 1–15

anomalies. Also, because most of the tasks have dependent relationships, the failure of one service may cause the failure of other services. Thus, the detection of such transmissible failure must be conducted in a very efficient way. For example, for tenant Webjet, it needs to perform four tasks: Airline Ticket Search, Car Rental, Hotel Search and Insurance Quote. If the service for Airline Ticket Search has errors in its output, e.g., wrong date and time, then the services for Car Rental and Hotel Search will also be affected. For each task, there is a set of candidate services. As the numbers of tenants and candidate services can be very large in a distributed multi-tenant SBS [12], it is unfeasible to constantly monitor all the services especially in a centralised way. Thus, an efficient decentralised monitoring approach, which also supports multi-tenancy, is necessary. In this paper, an agent-based decentralised approach is proposed, where each task in an SBS is equipped with an agent for monitoring. Then, at the beginning of each time slot, an agent autonomously decides whether it monitors the task in this time slot. Generally, a task, which is shared by more tenants, is more important than those tasks which are shared by fewer tenants, because if this task fails, more tenants will be affected. Thus, this task should be monitored more often.

Coalition 2 Agent 5 Coalition 1 Agent 6

Agent 1

Agent 2

Agent 3

Agent 4 Coalition 3

Figure 2. An Agent-based SBS Table 1. Payoff matrix of a supervisor and a worker

``` ``` A worker ` laze A supervisor ````` inspect r − c, −(p − b) not inspect −r, b

work −c, 0 0, 0

gets positive reward, r − c. Here, p is the penalty that the supervisor puts on the worker, and b is the benefit obtained by the worker for lazing. Usually, p > b, as otherwise the worker will always choose to laze. r is the reward obtained by the supervisor for inspecting a lazy worker, and c is the cost used by the supervisor for the inspection. Usually, r > c, as otherwise the supervisor will never inspect the worker. If the supervisor inspects a working worker, the supervisor wastes inspection cost c, while the worker gets reward 0. If the worker lazes and is not inspected, the worker gets a benefit, b, while the supervisor loses reward r. Finally, if the worker works and the supervisor does not inspect, both of them get reward 0. Usually, c > 0, as otherwise the supervisor will inspect the worker everyday, and b > 0, as otherwise the worker will work everyday. Therefore, in this situation, the worker wants to set a probability for lazing, with which the worker can most likely avoid the supervisor’s inspection, while the supervisor wants to set a probability for inspection, with which the supervisor can most likely inspect a lazy worker. In this paper, a task agent is represented as a supervisor, and a task is represented as a worker. A task agent has two actions: monitor and not monitor. A task has two actions: fail and not fail. Similar to Table 1, the reward matrix of a task agent and a task is given in Table 2, where x11 = r − c, y11 = −(p − b); x12 = −c, y12 = 0; x21 = −r, y21 = b; x22 = 0, y22 = 0. We set that r > c > 0 and p > b > 0. In this service monitoring problem, cost, c, means the amount of resource which a task agent uses to monitor the corresponding task. Reward, r, means the avoided loss of the SBS. Penalty p and benefit b are meaningless in this problem which are set only to make the model complete. The service monitoring problem in this paper is how a task agent sets a monitoring probability, with which a failed task can be most likely detected. It should be noted that in this problem, a task cannot set a failing probability because tasks are unconscious. The failing probability of a task is fixed, which

4. Service Monitoring Approach Design As described in the motivation example, services in a distributed multi-tenant SBS have to be monitored in a decentralised manner. Multi-agent technique will be used to develop such a decentralised approach. In this section, we first describe the formal model of the service monitoring problem. Then, the decentralised approach will be presented in detail. Finally, the corresponding theoretical analysis will also be provided. 4.1. Model Description A distributed multi-tenant SBS is modeled as a distributed multi-agent system. In a multi-agent based SBS, each task is monitored by an autonomous agent which is called a task agent. A task is also known as a service class which includes a set of candidate services. For example, in Fig. 1, there are six tasks, i.e., six service classes, in the SBS: T 1 : Airline Ticket Search, ..., T 6 : Insurance Quote. Fig. 1 then can be redrawn as a multiagent system in Fig. 2. Comparing Fig. 2 to Fig. 1, each task in Fig. 1 is replaced by an agent in Fig. 2, where Agent i monitors task T i , i = 1, ..., 6. Moreover, in Fig. 2, the six agents form three partially overlapping coalitions, where each coalition is formed to serve one tenant. These coalitions will dismiss once the corresponding tenants leave the SBS. Formally, service monitoring between a task agent and a task is modeled as a two-player, two-action supervisory game. In a general supervisory game [52], there are a supervisor and a worker. Everyday, both the supervisor and the worker have two choices. The supervisor can choose to inspect the worker or not, while the worker can choose to work or laze (Table 1). If the worker is lazing and is inspected by the supervisor, the worker gets a negative reward, −(p − b), and the supervisor

4

4

D. Ye, Q. He, Y. Wang and Y. Yun / Journal of Parallel and Distributed Computing 00 (2018) 1–15 Table 2. Payoff matrix of a task agent and a task

1: P&O Cruise

not fail x12 , y12 x22 , y22

2: WebJet

however is unknown to the monitoring task agent. Although tasks are unconscious, the service monitoring problem can still be fromalised by using the game-theoretical model. This can be explained by the fact the game-theoretical model is flexible and unconscious players, who cannot set action probability distributions, are allowed in the model. As the SBS is a multi-tenant SBS, based on how services are shared, there are four multi-tenancy maturity levels [44, 12]. Level 1: At the first level, an independent instance of the SBS is composed and optimised for each tenant. For the example given in Fig. 1, in maturity level 1, three independent systems are composed, each of which is customised for each travel agency. The travel agencies do not share an execution engine or any component services as shown in Fig. 3.

S1,3

S2,5 S3,3

3: Rail Plus

S4,2

Figure 5. Maturity level 3 example

anced by a tenant workload balancer. The three travel agencies share all the execution engines and certain component services of each system instance as depicted in Fig. 6.

Execution Engine

S1,3

S2,5

S4,1

S3,3

Execution Engine

S4,2

S3,6 3: Rail Plus

Execution Engine

S6,5

Tenant workload balancer

1: P&O Cruise

S5,1

2: WebJet

3: Rail Plus

S6,5

S6,2

Execution Engine

2: P&O Cruise

Execution Engine

1: WebJet

S5,1

Execution Engine

``` ``` A task ` fail A task agent ````` monitor x11 , y11 not monitor x21 , y21

5

S4,2

Figure 3. Maturity level 1 example

Execution Engine

Level 2: At the second level, one system instance is composed that enacts three execution plans, each of which is customised for one travel agency. The travel agencies share an execution engine but not the component services as displayed in Fig. 4. S6,2

Figure 6. Maturity level 4 example

1: WebJet

3: Rail Plus

Execution Engine

2: P&O Cruise

S1,3

S2,5

S3,3

To make the proposed approach applicable in different maturity levels, agent cloning technology is used, where agents generate new agents to take part of their work load once they are overloaded [53, 54]. For example, comparing Fig. 4 to Fig. 5, it can be seen that in Fig. 4 two services from task T 3 are used, i.e., S 3,3 and S 3,6 , and two services from task T 6 are used, i.e., S 6,2 and S 6,5 , whereas in Fig. 5, only one service from each task is used. Therefore, agent cloning technology will be used when the proposed approach is applied in maturity level 2. Then, Fig. 2 will be redrawn as Fig. 7, where both Agent 3 and Agent 6 generate a new agent and allocate a monitoring work to the new agent, e.g., Agent 3 monitoring S 3,6 and Agent 3’ monitoring S 3,3 .

S5,1 S6,5

S4,2

S4,3

S3,6

Figure 4. Maturity level 2 example

Level 3: At the third level, all tenants share an execution engine and some tenants share some sets of component services as shown in Fig. 5. Level 4: At the fourth level, all tenants’ workloads are bal5

D. Ye, Q. He, Y. Wang and Y. Yun / Journal of Parallel and Distributed Computing 00 (2018) 1–15

Algorithm 1: Service monitoring of a task agent

Coalition 2 Agent 5

Let k = 1 be the time slot, γ be the discount factor, and ξ be the learning rate; For each action, a, initialise value function Q(a) to 0 and initialise the probability for selecting the action, π(a), to 1n , where n is the number of available actions; repeat select an action, a, from available actions based on the probability distribution π over these actions; take the selected action, observe payoff, pay, and update the Q-value of the selected action: Q(a) ← (1 − ξ)Q(a) + ξ(pay + γmaxa0 Q(a0 )); approximate the failure probability of the monitored task based on the Q-value, ; evaluate the importance of the task; for each action a, update the probability for selecting the action, π(a), based on the approximation and evaluation; π ← Normalise(π); k ξ ← k+1 · ξ; k ← k + 1; until the monitored task fails and is detected;

1

Agent 6 2

Coalition 1 Agent 1

Agent 6’

Agent 2

Agent 3’

Agent 4

Agent 3

6

3 4 5

6

Coalition 3 7 8

Figure 7. An Example of Agent Cloning

9 10

4.2. Detail of the Approach

11

Based on the above model, a reinforcement learning algorithm is proposed. Task agents use this algorithm to learn their optimal actions via trial-and-error interactions with the monitored tasks. The algorithm is a Q-learning algorithm (Algorithm 1), which is a type of reinforcement learning algorithms. The benefit of reinforcement learning is that an agent does not need a teacher to learn how to solve a problem. The only signal used by the agent to learn from its actions in dynamic environments is payoff which is a number that tells the agent if its last action was good or not [55]. Q-learning is model-free, which enables agents to act optimally in Markovian domains with only local information [56]. During the learning process, in a particular state, an agent selects an action to carry out. This selection is based on a probability distribution over available actions. For example, there are two actions: act1 and act2 and the probability distribution over the two actions is: {0.3, 0.7}. Thus, action act1 has probability 0.3 to be selected while act2 has probability 0.7 to be selected. The agent, then, estimates the consequence of the selected action based on the immediate reward which the agent receives by carrying out that action. Algorithm 1 depicts the operation of the proposed approach. In the first time slot, the agent initialises the learning rate ξ, the discount factor γ, the Q-value of each action Q(a) and the probability for selecting each action π(a) (Lines 1 and 2). Then, in each following time slot, the algorithm starts from Line 3. Then, in Line 4, the agent selects an action: monitor or not monitor, based on the probability distribution over the two actions. In Line 5, the agent observes payoff and updates the Q-value of the selected action based on the payoff. Afterward, in Lines 6 and 7, the agent uses the updated Q-value to approximate the failure probability of the monitored task (Subsection 4.3) and then evaluates the importance of the monitored task (Subsection 4.4). Thereafter, in Line 8, the agent uses the approximation and evaluation to update the action selection probability (Subsection 4.5). In Line 9, the agent normalises probability distriP bution π to ensure that each π(ai ) ∈ [0, 1] and 1≤i≤n π(ai ) = 1 (Subsection 4.6). In Line 10, to insure the convergence of the algorithm (Subsection 4.7), learning rate ξ is decayed. Finally,

12

in Line 11, time slot k is increased by 1 to demonstrate the time progress. In Algorithm 1, there are four problems which have to be addressed. 1. How does an agent approximate the failure probability of the monitored task (Line 6 of Algorithm 1)? 2. How does an agent evaluate the importance of the monitored task (Line 7 of Algorithm 1)? 3. How does an agent update the action selection probability based on the evaluation and approximation (Line 8 of Algorithm 1)? 4. How does an agent normalise an invalid probability distribution (Line 9 of Algorithm 1)? The solutions of the four problems are provided in the following sub-sections, respectively. 4.3. How does an agent approximate the failure probability of the monitored task? According to Table 2, based on statistical knowledge, the expected payoffs of a task agent for taking actions: monitor and not monitor, are X X X P1 = lim ( x i j αi β j ) = x1 j β j (1) α1 →1

1≤i≤2 1≤ j≤2

1≤ j≤2

and P2 = lim

α2 →1

6

X X X ( x i j αi β j ) = x2 j β j ,

1≤i≤2 1≤ j≤2

(2)

1≤ j≤2

respectively. In Equations 1 and 2, P1 is the expected payoff of a task agent for taking action monitor and P2 is the expected payoff of a task agent for taking action not monitor. Moreover, α1

D. Ye, Q. He, Y. Wang and Y. Yun / Journal of Parallel and Distributed Computing 00 (2018) 1–15

and α2 are the probabilities of the task agent for taking actions monitor and not monitor, respectively, while β1 and β2 are the probabilities of the monitored task for failure and not failure. Based on mathematical knowledge, it can be proved that if an action is carried out in an infinite number of times and the learning rate ξ is decreased properly and gradually, the Q-value of the action will finally converge to Q∗ with probability 1 [57]. The Q-value can be obtained by using the equation which is given in Line 5 of Algorithm 1) and Q∗ is the expected payoff of an agent by carrying out that action. The proof will be given in Subsection 4.7. According to Equations 1 and 2, it can be found that the task agent’s expected payoff for taking each action depends on the probabilities of the monitored task for failure and not failure. Based on this finding, to approximate an action’s expected payoff, an agent can use that action’s current Q-value (i.e., Equations 3 and 4). X Q(1) = P1 = x1 j β j (3)

the SBS; umax is the maximum number of users of the SBS; max is the upper bound of importance; min is the lower bound of importance. Hence, the higher the value of Ii is, the more important task i is. 4.5. How does an agent update the action selection probability? Using a gradient ascent algorithm [58], an agent’s expected payoff can be augmented by moving its strategy, with a step size, in the direction of the current gradient. The gradient is calculated as the partial derivative of the agent’s expected payoff regarding its probability for selecting each action. As the P P task agent’s expected payoff is P = 1≤i≤2 ( 1≤ j≤2 xi j αi β j ) and α1 + α2 = 1, we have X ∂P = x1 j β j (6) ∂α1 1≤ j≤2 and

1≤ j≤2

Q(2) = P2 =

X

x2 j β j

(4)

As Q(1) and Q(2) are known to the task agent, β1 and β2 can be calculated using Equations 3 and 4. After the calculation, the task agent adjusts the probabilities for selecting each action. Such adjustment will be depicted in the next sub-section. It is unfeasible to really carry out an action in an infinite number of times. Therefore, the expected payoff cannot be precisely computed but can only be approximated. However, with the continuation of the learning process, the optimal Q-value, Q∗ , can be gradually reached through the learning algorithm. Hence, the approximation precision will be gradually improved. Then, the task failure probability can be computed more and more precisely by the task agent.

α(k+1) = (max − Ii ) ∗ (α2(k) + η 2

∂P ∂α(k) 2

),

(9)

where η is the gradient step size. Based on mathematical knowledge [59], it is known that the procedure can converge if η is small enough.

(5)

where Ii is the importance of task i; ui is the current number of users of task i; umin is the minimum number of users of

(7)

and

4.4. How does an agent evaluate the importance of the monitored task? As described in Section 3, the environment is a multi-tenant environment, which implies that a task may be shared by multiple tenants. Moreover, even if two tasks are shared by the same number of tenants, each tenant may provide services to different numbers of end users, which means that the two tasks are used by different numbers of end users. Thus, the two tasks have different importance. Therefore, in a multi-tenant environment, the importance of each task is not identical, which should be considered by task agents when updating the action selection probability. Generally, a task is more important than another task, if the task is shared by more tenants than another task. Also, when two tasks are shared by the same number of tenants, the task, which is used by more end users, is more important than another task. Formally, the importance evaluation model is shown as Equation 5. ui − umin ∗ (max − min) + min, umax − umin

X ∂P = x2 j β j . ∂α2 1≤ j≤2

By comparing Equations 3 and 6, and Equations 4 and 7, respectively, it can be seen that the partial derivative of the agent’s expected payoff can be approximated by using the corresponding Q-value. (k) Suppose that in the kth time slot, α(k) 1 and α2 are the task agent’s probabilities for selecting actions monitor and not monitor, respectively, and the importance of the task is Ii . Then the new probabilities for the next time slot are updated to ∂P α(k+1) = Ii ∗ (α1(k) + η (k) ) (8) 1 ∂α1

1≤ j≤2

Ii =

7

7

4.6. How does an agent normalise an invalid probability distribution? Proportion-based mapping is employed to conduct the normalisation. Because the invalid probability distribution contains the learned knowledge, the normalised probability distribution should be as ‘close’ as possible to the invalid one so as to preserve the learned knowledge. Using proportion-based mapping, each invalid probability can be adjusted into range (0, 1) and the new probability can be at the same magnitude as the old one. The function Normalise() is given in Algorithm 2 in pseudocode form. In Line 2, π(ak ) is the probability of selecting action ak . From Line 3 to Line 6, any probability, which is smaller than mapping lower bound L, is adjusted to L. Other probabilities, which are larger than L, are increased proportionally. Then, from Line 7 to Line 9, all the probabilities are summed up and each probability is further adjusted based on its proportion in the sum.

D. Ye, Q. He, Y. Wang and Y. Yun / Journal of Parallel and Distributed Computing 00 (2018) 1–15

P Then, we prove that 1≤i ξki = +∞ always holds. For a given small positive number  and a given positive integer N, let 0 <  < L · ξ21 , where L = 0.001 which has been defined in Algorithm 2. Because π(a) ≥ L > 0, there is at least one integer m > N, such that in the (m+1)th, (m+2)th, ..., (m+m)th repeat of Algorithm 1, the number of times that action a is tried is larger P P than or equal to L · m. Hence, m . According to Theorem 1, because m , namely that the convergent condition P in Theorem 1 is not met, the series, 1≤i ξki , does not converge. P Also, because each ξki is positive, then 1≤i ξki = +∞. P P Finally, since we have proven that 1≤i ξk2i < +∞ and 1≤i ξki = +∞, according to Theorem 2, in Algorithm 1, as i → +∞, Q(a) converges, with probability 1, to the expected payoff of using action a.

Algorithm 2: Normalise() 1 2 3 4 5 6 7 8 9 10

Suppose that there are m actions, i.e., a1 , a2 , ..., am ; Let d = min1≤k≤m π(ak ), mapping center c0 = 0.5, and mapping lower bound L = 0.001; if d < L then ρ ← cc00 −L ; −d for k = 1 to m do π(ak ) ← c0 − ρ · (c0 − π(ak )); for k = 1 to m do P r ← 1≤k≤m π(ak ); π(ak ) ← π(ar k ) ;

return π;

4.7. Theoretical analysis In this sub-section, the proof of the convergence of Algorithm 1 will be given. In Algorithm 1, define ξk as the learning rate in the kth repeat (i.e., in the kth time slot, as Algorithm 1 is executed in k k each time slot). Because ξk+1 = k+1 · ξk = k+1 · k−1 k · ξk−1 = ..., it 1 can be deduced that ξk = k · ξ1 . Also, define ki as the index of the ith time that action a is tried. For example, suppose that the 1st time that action a is tried is in the 5th repeat of Algorithm 1, then k1 = 5. To prove the convergence of Algorithm 1, Theorems 1 and 2 are required. P Theorem 1. A series, 1≤k ak , converges if and only if, for any given small positive number  > 0, there always exists an integer N, such that when n > N, any positive integer m can make inequality |an+1 + an+2 + ... + an+m | <  hold.

5. Experiment and Analysis In this section, the proposed agent-based decentralised service monitoring approach, named Decentralised, is evaluated in comparison with two other related and representative approaches which are described as follows. 1) The centralised approach (named Centralised): As stated in Section 1, most of the existing approaches are centralised. Thus, centralised approaches are representative in service monitoring. In this experiment, for the centralised approach, there is an omniscient central controller which has the global knowledge of the SBS. At the beginning of each time slot, the central controller uses the integer programming approach [61] to calculate a monitoring strategy for the system based on the failure probability and importance of each task. 2) The random approach (named Random): The random approach is created by us, which is a simplified version of the proposed decentralised approach. In the random approach, at the beginning of each time slot, each agent randomly decides whether it monitors the task in this time slot. The random approach can be used to demonstrate the effectiveness and advancement of the proposed decentralised approach. In this experiment, the performance of the three approaches, i.e., Decentralised, Centralised and Random, is evaluated by measuring four quantitative metrics: response time, computation overhead, monitoring overhead and user unhappiness.

Proof. The proof of this theorem can be found in any calculus textbook, such as [60]. It should be noted that in Theorem 1, if the convergent condition is not met, i.e., |an+1 + an+2 + ... + an+m | ≥ , the series, P 1≤k ak , does not converge.

Theorem 2. Given bounded payoffs and learning rate 0 ≤ ξki ≤ P P 1, if for any action a, series 1≤i ξk2i is bounded (i.e., 1≤i ξk2i < P P +∞) and series 1≤i ξki is unbounded (i.e., 1≤i ξki = +∞), then, as i → +∞, Q(s, a) converges, with probability 1, to the optimal Q-value, i.e., the expected payoff.

1. Response time means the average number of time slots between the time that a task fails and the time that the failure is detected. Each time slot is set to 2 seconds. 2. Computation overhead is the average time that is used by each approach to achieve a monitoring strategy. The unit of the time is millisecond. 3. Monitoring overhead is the average number of times that each agent monitors its target during the experimental period. 4. User unhappiness is defined as 1 if a user cannot access an agreed service for one unit time, i.e., a time slot.

Proof. The proof of this theorem can be found in [56]. Theorem 3. In Algorithm 1, for any action a, as i → +∞, Q(a) converges, with probability 1, to the expected payoff of using action a. Proof. According to Theorem 2, in order to prove Theorem 3, we only need to prove that in Algorithm 1, for any action a, P P 2 1≤i ξki < +∞ and 1≤i ξki = +∞ always hold. P We first prove that 1≤i ξk2i < +∞ always holds. Based on P 2 mathematical knowledge, it has been known that 1≤k k12 = π6 , where k is a positive integer and tends to +∞. Then, it can be P P ξ2 P ξ2 2 deduced that 1≤i ξk2i = 1≤i k12 ≤ 1≤k k12 = ξ12 · π6 < +∞. i

8

8

D. Ye, Q. He, Y. Wang and Y. Yun / Journal of Parallel and Distributed Computing 00 (2018) 1–15

5.1. Experimental setup The experiments were conducted in a dynamic SBS in four scenarios. In the first scenario, it is measured how the performance of the three approaches changes with the increase of the initial number of tenants. In the second scenario, it is measured that how the performance of the three approaches changes with variation of task failure probability. In the third scenario, it is measured that how the performance of the three approaches changes with the variation of tenant joining/leaving probability. In the fourth scenario, it is measured that how the performance of the three approaches changes in different multi-tenancy maturity levels. The values and meanings of the parameters used in the experiment are listed in Table 3. Generally, the values of reward parameters, r, c, p, b, depend on real SBSs, because in different SBSs, the cost c, i.e., the consumed resource for monitoring, is different. In this paper, for simplicity, we set cost to 1. The values of the three other parameters r, p, b are hand-tuned which can yield a quick convergence. In addition, the values of the learning parameters ξ, γ, η are also hand-tuned which can achieve the best results. The setup of a time slot length also depends on real SBSs. Usually, an important SBS needs a short time slot length. In this paper, for simplicity, we set time slot length to 2 seconds. Moreover, the experiment was coded in Java on an Intel i5-2450m 2.5GHz Windows 10 PC with 10GB RAM.

Response Time

5

3 2 1 Random 10

Centralised

Decentralised

20

30

40

50

60

70

80

Initial Number of Tenants

90

100

(a) Response time

Computation Overhead

700 Random

600

Centralised

Decentralised

500 400 300 200 100 0 10

20

30

40

50

60

70

80

90

100

Initial Number of Tenants (b) Computation overhead

Explanations reward parameters Learning rate Discount factor Gradient step size Upper bound of importance Lower bound of importance the length of a time slot the number of users of each tenant

Monitoring Overhead

Values 5, 1, 6, 2 0.8 0.65 0.0001 1 0 2 seconds 100 ∼ 300

4

0

Table 3. Parameters Setting Parameters r, c, p, b ξ γ η max min

9

80

60

40 Random

Centralised

Decentralised

20 10

5.2. Experimental results 5.2.1. The first scenario Fig. 8 demonstrates the performance of the three approaches with different initial number of tenants. In this scenario, the average task failure probability is set to 0.3 with variance in 0.05. The tenant joining or leaving probability is set to 0.2. The maturity level is set to level 31 . In Fig. 8(a), the response time under the Random and Decentralised approaches keeps relatively steady irrespective of the initial number of tenants, whereas the response time under the Centralised approach rises with the increase of the initial number of tenants. In the Random and Decentralised approaches, each task is monitored by an agent which makes decisions autonomously. Thus, the number of tenants does not

20

30

40

50

60

70

80

Initial Number of Tenants

90

100

User Unhappiness (*103)

(c) Monitoring overhead

6

4

2 Random

Centralised

Decentralised

0 10

20

30

40

50

60

70

80

Initial Number of Tenants

90

100

(d) User unhappiness

1 In level 3, tenants share both the execution engine and certain component services whereas in levels 1 and 2, tenants do not share any component services. Thus, strictly, levels 1 and 2 are not multi-tenancy. In level 4, for each system instance, there is a set of execution engines and certain component services which are shared by tenants. This will increase the workload of the central manager in the Centralised approach. We, therefore, set the system maturity level to level 3 in the experiment.

Figure 8. Performance of the three approaches with different initial number of tenants

9

D. Ye, Q. He, Y. Wang and Y. Yun / Journal of Parallel and Distributed Computing 00 (2018) 1–15

affect the response time under the Random and Decentralised approaches. In the Centralised approach, the central controller obtains a monitoring strategy by considering the failure probabilities of all the tasks. When the number of tenants is large, the number of tasks becomes large as well. In this situation, to obtain a monitoring strategy in a timely manner, the central controller has to sacrifice the quality of the monitoring strategy. Thus, the response time under the Centralised approach becomes longer as the number of initial tenants increases. In Fig. 8(b), the computation overhead of the Random and Decentralised approaches keeps almost steady with the increase of the initial number of tenants, while the computation overhead of the Centralised approach increases with the increase of the initial number of tenants. In the Random and Decentralised approaches, each agent monitors one task only and each agent obtains a monitoring strategy locally and individually. Therefore, the increase of the number of tenants does not affect the computation overhead of an agent. In the Centralised approach, a monitoring strategy is obtained based on all the tasks, so the increase of the number of tenants will augment the computation overhead of the central controller. In Fig. 8(c), the overall monitoring overhead of the three approaches rises with the increase of the initial number of tenants. The number of tasks in the SBS ascends with the increase of the number of tenants. Also, in the multi-tenant environment, each task, e.g., service class, may provide multiple services so as to meet different tenants’ requirements, and each task is monitored by one agent. Thus, the overall monitoring overhead unavoidably grows. In Fig. 8(d), the user unhappiness under the three approaches steadily rises with the increase of the initial number of tenants. This is due to the fact that with the increase of the number of tenants, the number of end users also increases. As the statistics of user unhappiness is accumulated, the increase of end users will incur the increase of user unhappiness. It can be seen that the proposed Decentralised approach responds as quickly as the Centralised approach (Fig. 8(a)) and uses as little computation as the Random approach (Fig. 8(b)). The basic working processes of the Decentralised approach and the Random approach are the same, where both of the approaches calculate monitoring strategies based only on local information. Hence, the Decentralised approach and the Random approach have nearly the same computation overhead. As the Decentralised approach uses the multi-agent learning technique which makes the agents become smarter and smarter, the Decentralised approach can reduce the response time as low as the Centralised approach.

10

5

Response Time

4 3 2 1 Random

Centralised

Decentralised

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

Failure Probability

0.1

Computation Overhead

(a) Response time

600 Random

Centralised

Decentralised

500 400 300 200 100 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

Failure Probability

0.1

Monitoring Overhead

(b) Computation overhead

80

60

40 Random

Centralised

Decentralised

20 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

Failure Probability

(c) Monitoring overhead

User Unhappiness (*103)

6 5 4 3 2 1

Random

Centralised

Decentralised

0

5.2.2. The second scenario Fig. 9 displays the performance of the three approaches changes with the variation of task failure probability. In this scenario, the initial number of tenants is set to 50. The tenant joining or leaving probability is set to 0.2. The maturity level is level 3. In Fig. 9(a), with the increase of task failure probability, the response time under all of the three approaches decreases

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

Failure Probability

0.1

(d) User unhappiness Figure 9. Performance of the three approaches with different task failure probabilities

10

D. Ye, Q. He, Y. Wang and Y. Yun / Journal of Parallel and Distributed Computing 00 (2018) 1–15

gradually, and the rate of decrease in the Centralised and Decentralised approaches is steeper than that in the Random approach. In the Random approach, the monitoring probability in each time slot is almost fixed at 0.5. Thus, when the failure probability of each task increases, the failure can be detected in a shorter time. In the Centralised and Decentralised approaches, the monitoring strategies are created based on the failure probability of each task. When the failure probability increases, the monitoring probability increases correspondingly and is finally larger than 0.5. Therefore, the response time under the Centralised and Decentralised approaches decreases as well. Moreover, as the monitoring probability in the Centralised and Decentralised approaches becomes larger than that in the Random approach, the rate of response time decrease in the Centralised and Decentralised approaches is steeper than that in the Random approach. In Fig. 9(b), the increase of task failure probability does not influence the computation overhead of the Random and Centralised approaches but incurs extra computation overhead on the Decentralised approach. In the SBS, when a task fails, one or several back up services will be used to replace the failed services. In the Decentralised approach, as the back up services are new to the monitoring agents, agents need to learn and approximate the failure probability of the back up services. Such learning and approximation cost extra computation overhead. In the Random and Centralised approaches, there are no such learning and approximation process, so the computation overhead is not affected. In Fig. 9(c), with the increase of task failure probability, the Centralised and Decentralised approaches use more and more monitoring overhead while the monitoring overhead of Random approach is relatively steady. As the monitoring strategies created by the Centralised and Decentralised approaches are based on task failure probabilities, the increase of task failure probability will incur more frequent monitoring, i.e., more monitoring overhead. In the Random approach, the monitoring strategy is built randomly, so its monitoring overhead keeps almost stable. In Fig. 9(d), with the increase of task failure probability, the user unhappiness under three approaches grows but the increase under the Random approach goes sharper than the Centralised and Decentralised approaches. With the increase of task failure probability, the number of users who cannot access agreed services increases. As the Random approach does not use any intelligent technology, the user unhappiness under it unavoidably increases. In contrast, both the Centralised and Decentralised approaches take failure probability and importance of tenants into account, so the response time can be decreased (Fig. 9(a)). Because the calculation of user unhappiness is based on the number of affected users and the affected time length, the decrease of the response time can offset the increase of the number of affected users to some extent. 5.2.3. The third scenario Fig. 10 shows the performance of the three approaches changes with the variation of tenant joining/leaving probability. In this scenario, the initial number of tenants is set to 50.

11

5

Response Time

4 3 2 1 Random

Centralised

Decentralised

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

Tenant Joining/Leaving Probability

0.1

Computation Overhead

(a) Response time

600 500

Random

Centralised

Decentralised

400 300 200 100 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

Tenant Joining/Leaving Probability

0.1

Monitoring Overhead

(b) Computation overhead

80 60 40 20 Random

Centralised

Decentralised

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

Tenant Joining/Leaving Probability

0.1

User Unhappiness (*103)

(c) Monitoring overhead

6 5 4 3 2 1

Random

Centralised

Decentralised

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

Tenant Joining/Leaving Probability (d) User unhappiness Figure 10. Performance of the three approaches with different tenant joining/leaving probabilities

11

D. Ye, Q. He, Y. Wang and Y. Yun / Journal of Parallel and Distributed Computing 00 (2018) 1–15

The average task failure probability is set to 0.3 with variance in 0.05. The maturity level is level 3. In Figs. 10(a) and 10(b), with the increase of tenant joining/leaving probability, the response time and computation overhead of the Random and Centralised approaches remain almost unchanged, whereas the Decentralised approach uses more and more response time and computation overhead. When new tenants join in the SBS, new services have to be provided to the tenants. As the failure probabilities of the new services are unknown to the monitoring agents in the Decentralised approach, the monitoring agents need extra time and computation to learn the failure probabilities. In Fig. 10(c), with the increase of tenant joining/leaving probability, the monitoring overhead of the three approaches keeps relatively stable. When new tenants join in the SBS, the monitoring overhead of the three approaches increases, because the introduction of more tenants incurs extra monitoring overhead. However, when existing tenants leave the SBS, the monitoring overhead of the three approach reduces and this reduction can offset the aforementioned increase to some extent. In Fig. 10(d), with the increase of tenant joining/leaving probability, the user unhappiness under all the three approaches decreases. This can be explained by the fact that when a task fails, the affected tenant may leave the system before the failure is detected. With the increase of tenant leaving probability, this situation is more likely to happen. As the user unhappiness is based not only on the number of affected users but also on the affected time length, tenant leaving early can reduce user unhappiness.

Response Time

5 4 3 2 1 Random

Centralised

Decentralised

0 1

2

3

4

Maturity Level

Computation Overhead

(a) Response time

700 Random

Centralised

Decentralised

600 500 400 300 200 100 0 1

2

3

4

Maturity Level

Monitoring Overhead

(b) Computation overhead

120 100 80 60 40 20 Random

0 1

Centralised 2

Decentralised 3

Maturity Level

4

(c) Monitoring overhead

User Unhappiness (*103)

5.2.4. The fourth scenario Fig. 11 shows the performance of the three approaches changes with the variation of multi-tenancy maturity level. In this scenario, the initial number of tenants is set to 50. The average task failure probability is set to 0.3 with variance in 0.05. The tenant joining or leaving probability is set to 0.2. In Figs. 11(a) and 11(b), the response time and computation overhead of the Random and Decentralised approaches remain stable in different maturity levels, while the response time and computation overhead of the Centralised approach fluctuate in different maturity levels. As described above, multi-tenancy levels are based on the extent of sharing. Different extents of sharing incur different numbers of execution engines and component services. As in the Random and Decentralised approaches, each monitoring agent autonomously monitors each task and then makes monitoring strategies, the different numbers of execution engines and component services do not affect the response time and computation overhead of them. For the Centralised approach, in maturity level 1, each tenant has a separate execution engine which creates monitoring strategies for the tenant, whereas in maturity level 2, all the tenants share an execution engine. The execution engine in maturity level 2 has to create monitoring strategies for all the tenants, while in maturity level 1, each execution engine creates monitoring strategies for one tenant only. Thus, the computation overhead of the Centralised approach in maturity level 2 is larger than that in maturity level 1. However, the response time under the

12

5 4 3 2 1 Random

Centralised

Decentralised

0 1

2

Maturity Level

3

4

(d) User unhappiness Figure 11. Performance of the three approaches with different tenant joining/leaving probabilities

12

D. Ye, Q. He, Y. Wang and Y. Yun / Journal of Parallel and Distributed Computing 00 (2018) 1–15

Centralised approach in maturity level 2 is less than that in maturity level 1. In maturity level 2, the central controller, i.e., the central execution engine, has the global knowledge of the system but in maturity level 1, each execution engine has only information about one tenant. Therefore, the monitoring strategies formulated by the central controller in maturity level 2 are better than those created by each execution engine in maturity level 1. Then, the better monitoring strategies lead to shorter response time. For maturity levels 2 and 3, the only difference is that some component services are shared by some tenants in maturity level 3 while no component services are shared by any tenants in maturity level 2. As some component services are shared in maturity level 3, the total number of used services in maturity level 3 is less than that in maturity level 2. Thus, the response time and computation overhead of the Centralised approach in maturity level 3 are slightly less than those in maturity level 2. For maturity levels 3 and 4, the difference is that in maturity level 4, there are multiple system instances for tenants and each of the system instances can meet the tenants’ requirements. Compared to maturity level 3, although the total number of component services in maturity level 4 is much more, each system instance has a separate execution engine to generate monitoring strategies. Hence, for the Centralised approach, the computation overhead in maturity level 3 is almost the same as that in maturity level 4. Moreover, the response time under the Centralised approach in maturity level 3 is almost the same as that in maturity level 4 as well. In Fig. 11(c), it can be seen that in maturity level 3, all the three approaches achieve the least monitoring overhead, whereas in maturity level 4, all the three approaches achieve the most monitoring overhead. This is mainly because among the four maturity levels, the number of component services is the least in maturity level 3 while the number of component services is the most in maturity level 4. As the Random and Decentralised approaches are based on each individual monitoring agent, the monitoring overhead of the two approaches in maturity level 1 is almost the same as that in maturity level 2 and is slightly higher than that in maturity level 3. For the Centralised approach, the monitoring overhead in maturity level 1 is much more than that in maturity level 2. As analysed above, the monitoring strategies created by the Centralised approach in maturity level 2 is better than those in maturity level 1. Therefore, by using better monitoring strategies, the monitoring overhead can be reduced. In Fig. 11(d), the user unhappiness is the highest in maturity level 3 while the lowest in maturity level 4, and is almost the same in maturity levels 1 and 2. This is because in maturity levels 1 and 2, tenants do not share services. Therefore, the failure of one service affects one tenant at most. However, in maturity level 3, as tenants may share services, the failure of one service may affect more than one tenant. Thus, the user unhappiness is lower in maturity level 3 than in maturity levels 1 and 2. In maturity level 4, backup services are used. When a service fails, it can be replaced by a backup service immediately. Hence, the user unhappiness is the lowest in maturity level 4.

13

5.3. Discussion of experiments In this sub-section, four aspects of the experiments are discussed. 5.3.1. Choice of the comparison approaches In the expriments, a centralised approach and a random approach are used for comparison. The Centralised approach is one of the most popular approachs for service monitoring [41, 34, 28] and thus is used as a baseline for comparison in our evaluation. The Centralised approach has the potential of the single point of failure and is not suitable in large scale distributed SBSs. Compared with the Centralised approach, our approach is decentralised which can avoid the single point of failure and is suitable in large scale distributed SBSs. Therefore, through comparison with the Centralised approach in the experiments, the effectiveness of our approach in finding solutions, especially in large scale distributed SBSs, can be demonstrated. The Random approach is created by us. It is decentralised as well but does not involve any intelligent techniques, where each agent just randomly makes decisions. Thus, through comparison with the Random approach in the experiments, the advancement of the intelligent techniques used in our approach can be shown. The three approaches are evaluated in various scenarios with different scales, different dynamism and different maturity levels. 5.3.2. Setup of the experimental scenarios Four scenarios are set up in the experiments: the change in number of tenants in the system, the change in service failure probability, the change tenant joining/leaving probability, and the change in maturity level. The four scenarios are representative in multi-tenant SBSs. The change in number of tenants is used to simulate different scales of multi-tenant SBSs so as to evaluate the scalability of the three approaches. The change in service failure probability is used to simulate different service qualities so as to evaluate the robustness of the three approaches. The change in tenant joining/leaving probability is used to simulate different dynamism of multi-tenant SBSs so as to evaluate the adaptability of the three approaches. The change in maturity level is used to simulate different sharing levels so as to evaluate the cooperativity of the three approaches. 5.3.3. Comprehensiveness of the experiments Four metrics are measured in the experiments, including response time, computation overhead, monitoring overhead and user unhappiness. Other metrics, certainly, can also be used in the experiments, e.g., false positive rate and false negative rate. False positive means that an agent reports the failure of a service but actually that service does not fail, while false negative means that a service fails but the agent has not detected the failure [62]. The metric, false positive rate, has been involved in the metric, monitoring overhead. An approach with a high false positive rate implies that agents under this approach monitor their target services very often. Such frequent monitoring inevitably incurs a large monitoring overhead. The metric, false negative rate, has been involved in the metric, response time. 13

D. Ye, Q. He, Y. Wang and Y. Yun / Journal of Parallel and Distributed Computing 00 (2018) 1–15

An approach with a high false negative rate means that agents under this approach often miss the detection of service failure. Such missing consequentially causes a long response time.

14

design a more efficient learning algorithm. Moreover, in this paper, evaluation was done in a simulated environment. After the learning algorithm is improved, we intend to evaluate our approach with a real world data set, such as a publicly available Web service data set QWS [63].

5.3.4. Results of the experiments The results obtained in the experiments are averaged by running each approach 100 times in each scenario. Specifically, as shown in Section 4.7, each agent under our approach needs a learning process to approximate the failure probability of the monitored service. Thus, in early stages, the efficiency of our approach is similar to the Random approach, and the response time of our approach is longer than that of the Centralised approach. In late stages, after agents have learned knowledge about the failure probability of the monitored services, the efficiency of our approach is significantly improved. The results of the three approaches, provided in the experiments, are obtained after 200 time slots.

7. Acknowledgments This work is supported by an ARC Discovery Project (DP150101775) from Australian Research Council, Australia. References [1] D. Bruneo, S. Distefano, F. Longo, M. Scarpa, Stochastic Evaluation of QoS in Service-Based Systems, IEEE Transactions on Parallel and Distributed Systems 24 (2013) 2090–2099. [2] X. Chen, Z. Zheng, Q. Yu, M. R. Lyu, Web Service Recommendation via Exploiting Location and QoS Information, IEEE Transactions on Parallel and Distributed Systems 25 (7) (2014) 1913–1924. [3] W. Chen, I. Paik, Toward Better Quality of Service Composition Based on a Global Social Service Network, IEEE Transactions on Parallel and Distributed Systems 26 (2015) 1466–1476. [4] L. Baresi, S. Guinea, Self-Supervising BPEL Processes, IEEE Transactions on Software Engineering 37 (2) (2011) 247–263. [5] R. Calinescu, L. Grunske, M. Kwiatkowska, R. Mirandola, G. Tamburrelli, Dynamic QoS Management and Optimisation in Service-Based Systems, IEEE Transactions on Software Engineering 37 (2011) 387–409. [6] L. Baresi, S. Guinea, Event-Based Multi-level Service Monitoring, in: Proc. of IEEE 20th International Conference on Web Services (ICWS 2013), Santa Clara, CA, USA, 2013, pp. 83–90. [7] G. Alferez, V. Pelechano, R. Mazo, C. Salinesi, D. Diaz, Dynamic Adaptation of Service Compositions with Variability Models, Journal of Systems and Software 91 (2014) 24–47. [8] S. Wang, L. Huang, C. Hsu, F. Yang, Collaboration Reputation for Trustworthy Web Service Selection in Social Networks, Journal of Computer and System Sciences 82 (2016) 130–143. [9] D. Ardagna, B. Pernici, Adaptive Service Composition in Flexible Processes, IEEE Transactions on Software Engineering 33 (2007) 369–384. [10] S. C. Oh, D. Lee, S. R. Kumara, Effective Web Service Composition in Diverse and Large-scale Service Networks, IEEE Transactions on Services Computing 1 (2008) 15–32. [11] A. Moustafa, M. Zhang, Q. Bai, Trustworthy Stigmergic Service Composition and Adaptation in Decentralized Environments, IEEE Transactions on Services Computing 9 (2) (2016) 317–329. [12] Q. He, J. Han, F. Chen, Y. Wang, R. Vasa, Y. Yang, H. Jin, QoS-Aware Service Selection for Customisable Multi-Tenant Service-Based Systems: Maturity and Approaches, in: Proc. of the 8th IEEE International Conference on Cloud Computing (CLOUD 2015), 2015, pp. 237–244. [13] J. Li, X. Chen, M. Li, J. Li, P. Lee, W. Lou, Secure Deduplication with Efficient and Reliable Convergent Key Management, IEEE Transactions on Parallel and Distributed Systems 25 (2014) 1615–1625. [14] J. Li, X. Huang, J. Li, X. Chen, Y. Xiang, Securely Outsourcing Attributebased Encryption with Checkability, IEEE Transactions on Parallel and Distributed Systems 25 (2014) 2201–2210. [15] S. Wang, Z. Zheng, Z. Wu, M. R. Lyu, F. Yang, Reputation Measurement and Malicious Feedback Rating Prevention in Web Service Recommendation Systems, IEEE Transactions on Services Computing 8 (2015) 755–767. [16] Y. Ma, S. Wang, P. Hung, C. Hsu, Q. Sun, F. Yang, A Highly Accurate Prediction Algorithm for Unknown Web Service QoS Values, IEEE Transactions on Services Computing 9 (2016) 511–523. [17] P. Li, J. Li, Z. Huang, C. Gao, W. Chen, K. Chen, Privacy-preserving outsourced classification in cloud computing, Cluster Computing (2017) 1–10. [18] J. Li, Y. Zhang, X. Chen, Y. Xiang, Secure attribute-based data sharing for resource-limited users in cloud computing, Computers & Security 72 (2018) 1–12.

5.4. Summary Overall, in the experiments, it is demonstrated that our Decentralised approach can respond to task failures as quickly as the Centralised approach and uses the computation overhead as little as the Random approach. As our Decentralised approach distributes the computation and monitoring to each individual monitoring agents, the average computation overhead can be reduced in comparison with the Centralised approach. Moreover, as in our Decentralised approach, monitoring agents can learn the failure probability of each monitored target, our Decentralised approach can finally generate the monitoring strategies as good as the Centralised approach. As the Random approach lacks global knowledge and does not use any intelligent techniques, it uses the most monitoring overhead in various circumstances. It is also shown that the Centralised approach uses much more computation overhead than the Random and Decentralised approaches in all the situations especially in large scale SBSs (more than 50 tenants), because in the Centralised approach, there is only one controller which carries out all the computation. Thus, the Centralised approach can only work efficiently in small scale SBSs. 6. Conclusion and Future Work This paper proposes a novel agent-based decentralised service monitoring approach in distributed multi-tenant SBSs. This approach is the first decentralised monitoring approach in multitenant SBSs, which can generate monitoring strategies without the knowledge of service failure probabilities by using reinforcement learning techniques. Thus, the proposed approach can avoid the disadvantages of existing centralised approaches and can achieve almost the same good performance as centralised approaches. The major disadvantage of the proposed approach is that each monitoring agent needs a learning period to approximate the failure probability of the monitored target. To improve the efficiency of the proposed approach, the learning period may need to be shortened. Thus, it is one of our future studies to 14

D. Ye, Q. He, Y. Wang and Y. Yun / Journal of Parallel and Distributed Computing 00 (2018) 1–15 [19] C. Esposito, A. Castiglione, F. Pop, K. Choo, Challenges of Connecting Edge and Cloud Computing: A Security and Forensic Perspective, IEEE Cloud Computing 4 (2) (2017) 13–17. [20] X. Chen, F. Zhang, W. Susilo, H. Tian, J. Li, K. Kim, Identity-based chameleon hashing and signatures without key exposure, Information Sciences 265 (2014) 198–210. [21] J. Jiang, S. Wen, S. Yu, Y. Xiang, W. Zhou, Identifying Propagation Sources in Networks: State-of-the-Art and Comparative Studies, IEEE Communications Surveys and Tutorials 19 (2017) 465–481. [22] X. Chen, J. Li, J. Weng, J. Ma, W. Lou, Verifiable Computation over Large Database with Incremental Updates, IEEE transactions on Computers 65 (2016) 3184–3195. [23] Y. Wang, S. Wen, Y. Xiang, W. Zhou, Modelling the Propagation of Worms in Networks: A Survey, IEEE Communications Surveys and Tutorials (2014) 942–960. [24] S. Ivan, S. Wen, X. Huang, H. Luan, An Overview of Fog Computing and its Security Issues, Concurrency and Computation: Practice and Experience 28 (2015) 2991–3005. [25] A. Staikopoulos, O. Cliffe, R. Popescu, J. Padget, S. Clarke, TemplateBased Adaptation of Semantic Web Services with Model-Driven Engineering, IEEE Transactions on Services Computing 3 (2010) 116–130. [26] Q. He, J. Yan, H. Jin, Y. Yang, Quality-Aware Service Selection for Service-Based Systems Based on Iterative Multi-Attribute Combinatorial Auction, IEEE Transactions on Software Engineering 40 (2014) 192–215. [27] S. Wang, Y. Ma, B. Cheng, F. Yang, R. Chang, Multi-Dimensional QoS Prediction for Service Recommendations, IEEE Transactions on Services Computingdoi:10.1109/TSC.2016.2584058. [28] Q. He, J. Han, Y. Yang, H. Jin, J. Schneider, S. Versteeg, Formulating Cost-Effective Monitoring Strategies for Service-Based Systems, IEEE Transactions on Software Engineering 40 (2014) 461–482. [29] C. Bettini, D. Maggiorini, D. Riboni, Distributed Context Monitoring for the Adaptation of Continuous Services, World Wide Web 10 (2007) 503– 528. [30] K. S. Candan, W. S. Li, T. Phan, M. Zhou, Frontiers in Information and Software as Services, in: Proc. of the 25th International Conference on Data Engineering, 2009, pp. 1761–1768. [31] C. Verbowski, E. Kiciman, A. Kumar, B. Daniels, S. Lu, J. Lee, Y. Wang, R. Roussev, Flight Data Recorder: Monitoring Persistent-State Interactions to Improve Systems Management, in: Proc. of the 7th Symp. Operating Systems Design and Implementation (OSDI 2006), 2006, pp. 117– 130. [32] G. Heward, J. Han, I. Muller, J. Schneider, S. Versteeg, Optimizing the Configuration of Web Service Monitors, in: Proc. of the 8th International Conference on Service-Oriented Computing (ICSOC 2010), 2010, pp. 587–595. [33] G. Heward, I. Muller, J. Han, J. Schneider, S. Versteeg, Assessing the Performance Impact of Service Monitoring, in: Proc. of 21st Australian Software Engineering Conference, 2010, pp. 192–201. [34] H. Foster, G. Spanoudakis, Advanced Service Monitoring Configurations with SLA Decomposition and Selection, in: Proc. of ACM Symp. Applied Computing (SAC 2011), 2011, pp. 1582–1589. [35] S. Meng, L. Liu, Enhanced Monitoring-as-a-Service for Effective Cloud Management, IEEE Transactions on Computers 62 (2013) 1705–1720. [36] H. Zhang, R. Trapero, J. Luna, N. Suri, deQAM: A Dependency Based Indirect Monitoring Approach for Cloud Services, in: Proc. of 2017 IEEE International Conference on Services Computing (SCC), 2017, pp. 27–34. [37] Y. Wang, Q. He, D. Ye, Y. Yang, Formulating criticality-based costeffective monitoring strategies for multi-tenant service-based systems, in: Proc. of 2017 IEEE International Conference on Web Services (ICWS), 2017, pp. 325–332. [38] F. Wuhib, R. Stadler, A. Clemm, Decentralized Service-Level Monitoring Using Network Threshold Crossing Alerts, IEEE Communications Magazine 44 (10) (2006) 70–76. [39] M. J. Wooldridge, N. R. Jennings, Intelligent Agents: Theory and Practice, Knowledge Engineering Review 10 (1995) 115–152. [40] A. Keller, H. Ludwig, The WSLA Framework: Specifying and Monitoring Service Level Agreements for Web Services, Journal of Network and Systems Management 11 (2003) 57–81. [41] Q. Wang, J. Shao, F. Deng, Y. Liu, M. Li, J. Han, H. Mei, An Online Monitoring Approach for Web Service Requirements, IEEE Transactions on Services Computing 2 (2009) 338–351.

15

[42] F. Raimondi, J. Skene, W. Emmerich, Efficient Online Monitoring of Web-Service SLAs, in: Proc. of the 16th ACM SIGSOFT Int’l Symp. Foundations of Software Eng., 2008, pp. 170–180. [43] W. N. Robinson, Monitoring Web Service Requirements, in: Proc. of the 11th Int’l Conf. Requirements Eng., 2003, pp. 65–74. [44] F. Chong, G. Carraro, R. Wolter, Multi-Tenant Data Architecture (2006). URL https://msdn.microsoft.com/en-us/library/aa479086.aspx [45] J. Fiaidhi, I. Bojanova, J. Zhang, L. J. Zhang, Enforcing Multitenancy for Cloud Computing Environments, IT Professional 14 (2012) 16–18. [46] J. M. A. Calero, J. G. Aguado, MonPaaS: An Adaptive Monitoring Platform as a Service for Cloud Computing Infrastructures and Services, IEEE Transactions on Services Computing 8 (1) (2015) 65–78. [47] M. Dilman, D. Raz, Efficient Reactive Monitoring, in: Proc. of IEEE INFOCOM, 2001, pp. 1012–1019. [48] R. Keralapura, G. Gormode, J. Ramamirtham, Communication-Efficient Distributed Monitoring of Thresholded Counts, in: Proc. of ACM SIGMOD Int’l Conf. Management of Data, 2006. [49] S. Agrawal, S. Deb, K. V. M. Naidu, R. Rastogi, Efficient Detection of Distributed Constraint Violations, in: Proc. of Int’l Conf. Data Eng. (ICDE), 2007, pp. 1320–1324. [50] P. Zhang, C. Lin, X. Ma, F. Ren, W. Li, Monitoring-Based Task Scheduling in Large-Scale SaaS Cloud, in: Proc. of The 15th International Conference on Service-Oriented Computing (ICSOC 2016), Albert, Canada, 2016, pp. 140–156. [51] V. W. Chu, R. K. Wong, S. Fong, C. Chi, Emerging service orchestration discovery and monitoring, IEEE Transactions on Services Computing 10 (2017) 889–901. [52] X. Liang, Y. Xiao, Game Theory for Network Security, IEEE Communications Surveys and Tutorials 15 (1) (2013) 472–486. [53] O. Shehory, K. Sycara, P. Chalasani, S. Jha, Agent Cloning: An Approach to Agent Mobility and Resource Allocation, IEEE Communications Magazine 36 (1998) 58–67. [54] D. Ye, M. Zhang, D. Sutanto, Cloning, Resource Exchange and Relation Adaptation: An Integrative Self-organisation Mechanism in a Distributed Agent Network, IEEE Transactions on Parallel and Distributed Systems 25 (2014) 887–897. [55] L. P. Kaelbling, M. L. Littman, A. W. Moore, Reinforcement Learning: A Survey, Journal of AI Research 4 (1996) 237–285. [56] C. J. C. H. Watkins, Q-Learning, Machine Learning 8 (3) (1992) 279–292. [57] D. Ye, M. Zhang, A self-adaptive sleep/wake-up scheduling approach for wireless sensor networks, IEEE Transactions on Cyberneticsdoi:10.1109/TCYB.2017.2669996. [58] S. Singh, M. Kearns, Y. Mansour, Nash Convergence of Gradient Dynamics in General-Sum Games, in: Proc. of UAI, 2000, pp. 541–548. [59] D. Koller, N. Friedman, Probabilistic Graphical Models, MIT Press, 2009. [60] H. Anton, I. Bivens, S. Davis, Calculus, Wiley, 2012. [61] L. Wolsey, Integer Programming, Wiley-Interscience, 1998. [62] C. Modi, D. Patel, B. Borisaniya, H. Patel, A. Patel, M. Rajarajan, A Survey of Intrusion Detection Techniques in Cloud, Journal of Network and Computer Applications 36 (1) (2013) 42–57. [63] E. Al-Masri, Q. H. Mahmoud, Investigating web services on the world wide web, in: Proc. of the 17th International Conference on World Wide Web (WWW2008), 2008, pp. 795–804.

15

Dr Dayong Ye received his MSc and PhD degrees both from University of Wollongong, Australia, in 2009 and 2013, respectively. Now, he is a research fellow of computer science at Swinburne University of Technology, Australia. His research interests focus on service-oriented computing, self-organisation and multi-agent systems.

      Dr Qiang He received the his first PhD degree in information and communication technology from Swinburne University of Technology (SUT), Australia, in 2009. He is now a lecturer at Swinburne University of Technology. His research interests include software engineering, cloud computing and services computing.

Mr. Yanchun Wang received his M.Eng. degree in computer science from Beihang University, Beijing, China, in 2006. He is currently a Ph.D. student at Swinburne University of Technology, Australia. His research interests include service computing and cloud computing.

Prof. Yun Yang received the PhD degree from the University of Queensland, Australia in 1992. He is a full professor at Swinburne University of Technology. His research interests include software technologies, cloud computing, workflow systems and serviceoriented computing.

Research Highlights    We  propose  an  agent‐based  decentralised  service  monitoring  approach  in  distributed  service‐based systems.    This approach is decentralised and does not need global information.    This approach takes multi‐tenancy into account.    This approach can achieve good performance with low cost.