BigTrustScheduling: Trust-aware big data task scheduling approach in cloud computing environments

BigTrustScheduling: Trust-aware big data task scheduling approach in cloud computing environments

Journal Pre-proof BigTrustScheduling: Trust-aware big data task scheduling approach in cloud computing environments Gaith Rjoub, Jamal Bentahar, Omar ...

1MB Sizes 0 Downloads 42 Views

Journal Pre-proof BigTrustScheduling: Trust-aware big data task scheduling approach in cloud computing environments Gaith Rjoub, Jamal Bentahar, Omar Abdel Wahab

PII: DOI: Reference:

S0167-739X(19)30405-4 https://doi.org/10.1016/j.future.2019.11.019 FUTURE 5291

To appear in:

Future Generation Computer Systems

Received date : 11 February 2019 Revised date : 30 October 2019 Accepted date : 16 November 2019 Please cite this article as: G. Rjoub, J. Bentahar and O.A. Wahab, BigTrustScheduling: Trust-aware big data task scheduling approach in cloud computing environments, Future Generation Computer Systems (2019), doi: https://doi.org/10.1016/j.future.2019.11.019. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier B.V.

Journal Pre-proof

BigTrustScheduling: Trust-Aware Big Data Task Scheduling Approach in Cloud Computing Environments

a Concordia

pro of

Gaith Rjouba,1 , Jamal Bentahar a,2,∗, Omar Abdel Wahabb,3

Institute for Information Systems Engineering, Concordia University, Montreal, Canada of Computer Science and Engineering, Universit´e du Qu´ebec en Outaouais, Gatineau, Canada

b Department

Abstract

Big data task scheduling in cloud computing environments has gained considerable

re-

attention in the past few years, due to the exponential growth in the number of businesses that are relying on cloud-based infrastructure as a backbone for big data storage and analytics. The main challenge in scheduling big data services in cloud-based environments is to guarantee minimal makespan while minimizing at the same time the amount of utilized resources. Several approaches have been proposed in an attempt

lP

to overcome this challenge. The main limitation of these approaches stems from the fact that they overlook the trust levels of the Virtual Machines (VMs), thus risking to endanger the overall Quality of Service (QoS) of the big data analytic process, which includes not only heartbeat frequency ratio and resource consumption, but also secu-

urn a

rity challenges such as intrusion detection, access control, authentication, etc. To overcome this limitation, we propose in this work a trust-aware scheduling solution called BigTrustScheduling that consists of three stages: VMs’ trust level computation, tasks priority level determination, and trust-aware scheduling. Experiments conducted on a real Hadoop cluster environment using real-world datasets collected from the Google Cloud Platform pricing and Bitbrains task and resource requirements show that our solution minimizes the makespan by 59% compared to the Shortest Job First (SJF), by

Jo

48% compared to the Round Robin (RR), and by 40% compared to the improved Par∗ Corresponding

author. Tel.: +1 514 848 2424; fax: +1 514 848 3171 g [email protected] 2 Email: [email protected] 3 Email: [email protected] 1 Email:

Preprint submitted to Journal of Future Generation Computer Systems

October 29, 2019

Journal Pre-proof

ticle Swarm Optimization (PSO) approaches in the presence of untrusted VMs. Moreover, our solution decreases the monetary cost by 58% compared to the SJF, by 47%

pro of

compared to the RR, and by 38% compared to the improved PSO in the presence of untrusted VMs. The results in this work can be applicable to other problems. This would be possible through tuning the corresponding metrics in the formulation of the problem and solution, as will as in the experimental environment. In fact, the trust model can be extended to other environments including cloud computing, IoT, parallel computing, etc.

Keywords: Task scheduling, big data, cloud computing, trust

re-

1. Introduction

The term big data had arisen in the past few years as a building block concept, owing to the increase in the volumes of data that are generated by different businesses on a daily basis. This raised the need to adopt up-to-date solutions that could enable the storage, scheduling, and processing of large volumes of data. Cloud computing

lP

5

has been the number one choice for most of the businesses seeking to keep up with the big data evolution trends, thanks to the wide variety of advantages it offers such as multi-tenancy, elasticity, and virtualization. Thus, instead of purchasing and maintaining expensive hardware equipment to store and analyze big data, companies can now migrate these responsibilities to the cloud to reduce the capital and operational expen-

urn a

10

ditures and enjoy improved performance. This, however, entails a serious problem for cloud administrators, i.e., how to efficiently schedule big data tasks in the virtualized environments so as to guarantee better performance and minimal resource wastage. 1.1. Problem Statement

Several approaches have been proposed to improve the task scheduling process in cloud environments. The main idea of these approaches is to reduce the makespan by

Jo

15

trying to reduce the waiting time of tasks in the queues and attempting to map tasks to the nearest VMs to reduce the transfer time (i.e., the data locality concept). However,

2

Journal Pre-proof

this does not always guarantee a better performance in a dynamic and open environ20

ment like cloud computing. Specifically, with the large numbers of deployed VMs in

pro of

cloud-based systems, the chances of encountering untrusted or poorly-performing VMs is fairly high. Such VMs would not only cause the makespan to be high, but will also increase the amounts of wasted resources in case of malicious or compromised VMs. Consider a case where we have three VMs: VM1, VM2, and VM3 and an incoming 25

big data task whose waiting time on the three VMs is 30 seconds on VM1, 35 seconds on VM2, and 25 seconds on VM3. Suppose also that the trust values of the three VMs are 0.8 for VM1, 0.78 for VM2, and 0.56 for VM3. By applying the existing approaches, the task will be scheduled onto VM3 which gives the least waiting time

30

re-

for the task. However, makespan of this task execution on VMs would be 60 + 25 = 85 seconds, where 60 is the execution time of the task on VM3. On the other hand, if we apply our solution on the same scheduling problem, VM1 which enjoys the highest trust value would be picked to serve the underlying task. Given that this VM is highly trustworthy, its execution time would be small, e.g., 30 seconds. Consequently, the

35

lP

overall makespan of this task execution on that VM would be 30 + 30 = 60 seconds, which is significantly smaller than that of VM3. Moreover, most of the existing cloud-based scheduling solutions consider the scheduling problem only from the tasks perspective by setting a priority degree for each task

urn a

to determine the order in which the task will be scheduled and executed. Nevertheless, by focusing only on the task priorities and ignoring the reliability of the VMs, there 40

would still be a large risk to the overall Quality of Service (QoS) in the presence of untrusted VMs. For example, a high-priority task might be assigned to the nearest VM to minimize the transfer time. However, such a VM might undergo some failure, thus causing a huge spike in both the total makespan and monetary cost. Therefore, a more intelligent approach which takes into account all these factors is needed. 1.2. Contributions

Jo

45

Fig.1 shows the design of BigTrustScheduling, our trust-aware scheduling model

for big data tasks in cloud computing environments. The proposed solution consists of three stages: (1) VMs’ trust level computation; (2) tasks priority levels determination; 3

Journal Pre-proof

and (3) trust-aware scheduling. These stages are executed sequentially to minimize the 50

hardware and network recourses cost while decreasing the response time delivered to

pro of

customers. The first stage comprises three phases, namely: hearbeat-based initial trust value computation, statistics-based trust computation, and Markov Chain Monte Carlo Gibbs Sampler (MCMC)-based trust aggregation and VMs ranking. The objective of the first 55

phase is to derive initial trust values for the VMs based on the heartbeats average response time and average frequency ratio. Specifically, we adopt the concept of JobTracker from the Hadoop platform [1] whose role is to periodically collect messages (commonly called heartbeats) from tasktracker agents deployed on VMs to ensure that

60

re-

these VMs are still alive [2]. To further improve the accuracy of the trust establishment process, we propose in the second phase an Interquartile Range (IQR) statistical technique (Tukey method) [3] in which the cloud system continually monitors the CPU, RAM, bandwidth, and disk storage consumption to capture any abnormal behavior (i.e., over-utilization or under-utilization). IQR has been chosen for the con-

65

lP

sidered problem thanks to its robustness to outliers and its low overhead, which boosts its adoption in resource-constrained environments [4]. In the third phase, we employ the MCMC Gibbs sampler to aggregate the trust observations collected from the first and the second phases and come up with final trust values for each VM. The power

urn a

of Gibbs sampler comes from its ability to break down the high-dimensional parameter space into several low dimensional phases, thus facilitating the convergence to the 70

optimal distribution [5].

The second stage comprises three phases, namely: percentile-based task clustering, machine learning-based task clustering, and multi-criteria decision-making approach for task priority level determination. The first phase is interested in clustering the tasks we are dealing with on the basis of their hardware, network, and storage requirements, by means of a statistical technique. Note that our choice of the percentile-based method for tasks clustering based on their resource requirements stems from the sparse nature

Jo

75

of our data4 . This limits the performance of K-means (which highly depends on the 4 http://gwa.ewi.tudelft.nl/datasets/gwa-t-12-bitbrains

4

Journal Pre-proof

average) and results in a high number of outliers. Thereafter, in the second phase, we employ the K-means learning method to cluster the tasks according to their execution costs. The strength of K-means lies in its reasonable computational overhead for

pro of

80

high-dimensional data compared to other clustering techniques (e.g., K-medoids, hierarchical clustering, etc.) [6]. Moreover, K-means has been chosen in this step over the percentile method since our aim is to generate five clusters (i.e., very low, low, medium, high, and very high), whereas the percentile method usually works well for generating 85

a small number of clusters (i.e., three). On top of the two aforementioned clustering processes, we design in the third phase a multi-criteria decision-making approach to

Stage one

Task clustering based on resources requirement: Percentile method

Heartbeat response time monitoring

lP

VMs monitoring: Tukey method

Initial trust values

Initial trust values

Stage two

re-

determine the priority level of each task.

Clustered tasks

Trust aggregation and VMs ranking: MCMC-Gibbs sampler

urn a

Task clustering based on costs: K-means algorithm

Clusterd tasks

Task ranking mechanism: Multi-criteria decision making

Final trust values

VM 1

Task priority values Stage three

Task 1

VM 2

Task 2

...

...



VM n

Jo

Queue of VMs ranked based on their trust values



Trust aware scheduling

Task m

Queue of tasks ranked based on their priority values

Figure 1: The three stages of our proposed solution (BigTrustScheduling)

5

Journal Pre-proof

Having determined the trust level of each VM and the priority level of each task, we propose in the final stage a trust-aware scheduling approach to match each task to the corresponding VM. Such a trust-aware scheduling approach can be integrated as a

pro of

90

complementary step into the state-of-the-art big data schedulers such as YARN [7] and MESOS [8]. In summary, the main contributions of this paper are:

• Proposing a trust-aware scheduling mechanism to increase the performance of

big data services execution, which leads to minimize the makespan and the exe-

95

cution cost. This will be achieved by terminating the untrusted VMs and assigning high priority tasks to the most trusted VMs. To the best of our knowledge, this is the first work that advances a comprehensive trust-aware and performance-

re-

oriented scheduling approach for big data services on the cloud.

• Designing a trust establishment framework which combines performance infor100

mation related to the average heartbeat response time, average heartbeat fre-

lP

quency ratio, and VMs resources utilization to derive trust values for each VM. • Developing two clustering techniques from machine learning and statistics real

data to group tasks into a set of clusters based on their resource requirements and execution costs.

105

• Designing a multi-criteria decision-making mechanism to determine the priority

urn a

level of each task.

The rest of the paper is organized as follows. In Section 2, we conduct a literature review on the existing cloud-oriented scheduling approaches and stress the original contributions of this work. In Section 3, we formulate the trust-aware scheduling prob110

lem formally and give an overview of our solution. In Section 4, we provide a detailed description of our solution, highlight its different stages and phases, and present the corresponding algorithms. In Section 5, we explain the experimental environment used

Jo

to evaluate the performance of our solution and present empirical analysis of our solution against three existing scheduling approaches. Finally, in Section 6, we conclude

115

the paper and highlight some possible extensions of this work.

6

Journal Pre-proof

2. Related work In this section, we classify the existing task scheduling approaches in the domain trust-based scheduling, and other approaches. 120

pro of

of cloud computing into three major classes: artificial intelligence-based scheduling,

2.1. Artificial Intelligence (AI)-Based Scheduling

In [9], the authors address the problem of achieving a tradeoff between minimizing energy consumption in cloud centres, while minimizing at the same time the makespan of tasks. To do so, they design a multi-objective optimization problem based on the Dynamic Voltage Frequency Scaling System and employ the Non-dominated Sorting Genetic Algorithm (NSGA-II) to obtain feasible solutions. Moreover, they advance an

re-

125

Artificial Neural Network (ANN) technique to predict the corresponding VMs based on the tasks’ characteristics and resources’ features. The authors of [10] propose a hybrid task scheduling algorithm named FMPSO that is based on the Fuzzy system

130

lP

and Modified Particle Swarm Optimization technique to enhance the load balancing and throughput of the cloud systems. FMPSO first considers four modified velocity updating methods and selection technique to enhance the global search capability. Then, it uses cross-over and mutation operators used in genetic algorithms to overcome some of the PSO’s drawbacks. Finally, the proposed approach applies fuzzy

135

urn a

inference to compute the fitness values. In [11], the authors design a multi-objective optimization problem which seeks to minimize three conflicting objectives, i.e, execution cost, makespan, and resource utilization. The Epsilon-fuzzy dominance based composite discrete artificial bee colony (EDCABC) is then employed to derive Pareto optimal solutions for multi-objective task scheduling in cloud-based systems. In [12], the authors propose a novel algorithm named MGGS (modified genetic algorithm (GA) 140

combined with greedy strategy). The proposed algorithm leverages the modified GA

Jo

algorithm combined with greedy strategy to optimize task scheduling process. In [13], the authors propose a scheduling approach called Multi Label Classifier Chains Swarm Intelligence (MLCCSI) whose aim is to reduce the makespan of tasks’ scheduling. The approach consists of two strategies. The first strategy applies the Ant Colony

7

Journal Pre-proof

145

Optimization (ACO) algorithm, Artificial Bee Colony (ABC) algorithm and, Particle Swarm Optimization (PSO) algorithm to derive the optimal solution in terms of re-

pro of

source allocation for each corresponding task. In the second strategy, a machine learning technique is employed to predict the best algorithm (i.e, ACO, ABC, or PSO) that needs to be run to execute each corresponding task, based on several factors such as 150

task size and number of VMs.

The authors of [14] propose a Multi-Agent based Cloud Monitoring (MAS-CM) model to boost the performance and security of tasks gathering, scheduling, and execution in large-scale cloud systems. The main objective of this work is to prevent unauthorized task injection and alteration, while optimizing the scheduling process and maximizing resources utilization. In [15], the authors propose an improved version of

re-

155

the Particle Swarm Optimization (PSO) approach to improve the efficiency of resource management and reduce task makespan in cloud-based task scheduling scenarios. The main idea is to change the weights of the particles as the number of iterations grows up and to introduce random weights in the final stages. This helps avoid the case where the PSO algorithm generates local optimum solutions in its final stages. The authors

lP

160

compare their approach experimentally with the traditional PSO, Short Job First (SJF), Round-Robin (RR), and First Come First Serve (FCFS) and show that their approach can reduce the makespan with the increase in the number of tasks. In this work, we

165

urn a

compare our solution with this approach along with the SJF and RR approaches to demonstrate the effectiveness of our solution compared to the state-of-the-art especially in harsh environments that consist of large numbers of tasks and large numbers of untrusted VMs.

In summary, these approaches employ AI-based techniques to improve the scheduling process. In contrary, in our work, AI (K-means in our case) is coupled with many 170

other techniques (i.e., trust modeling, MCMC-based trust aggregation, and statistical analysis) to provide a comprehensive view of the task scheduling problem and improve

Jo

the performance of big data tasks scheduling in a wide variety of scenarios, i.e., large number of VMs, large number of tasks, and presence of untrusted VMs.

8

Journal Pre-proof

2.2. Trust Based Scheduling 175

In [16], the authors propose a scheduling model based on three factors, namely cost,

pro of

time, and trust. The main objective is to ameliorate users’ satisfaction using trust feedback data and risk cost estimation. Wang et al. [17] propose a Bayesian approach for a cognitive trust model which relies on direct and indirect trust sources to derive trust values for cloud resources. Thereafter, a trust-based dynamic level scheduling algorithm 180

called Cloud-DLS is advanced to minimize time costs and ensure a secure execution of tasks. In [18], the authors propose to integrate trust into workflow execution in cloud environments. They first investigate the trust relationship between users and cloud resources and then derive a trust-based on directed acyclic graph (DAG) scheduling

185

re-

algorithm. In [19], the authors propose a trust-aware adaptive DAG tasks scheduling algorithm using the reinforcement learning and Monte Carlo Tree Search (MCTS) method. By employing the scheduling state space, action space and reward function, the authors design a reinforcement learning model to train the policy gradient-based

lP

reinforcement agent. Moreover, the MCTS method can determine actual scheduling policies when DAG tasks are simultaneously executed on trusted and untrusted enti190

ties. The authors of [20] focus on the challenge of achieving secure scheduling of users’ requests. They propose a trust model which includes both direct trust and indirect reputation metrics. The paper focuses on assuring a trustworthy execution envi-

urn a

ronment in the cloud. Based on the designed trust model, users’ requests are scheduled onto the appropriate resources using a trust-based stochastic scheduling algorithm. In 195

summary, the existing trust-based scheduling approaches focus on improving the security of the cloud environment through relying on feedback collected from users. Unlike these approaches, we propose in this paper BigTrustScheduling, a comprehensive trust framework , which involves several statistical and machine learning techniques to determine the poor-performing VMs and improve the QoS delivered to users. 2.3. Other Scheduling Approaches

Jo

200

To take benefits of the cloud computing resources availability and abundance, sev-

eral scheduling approaches have been proposed [21, 22],[23]. A scheduling algorithm called Linear Scheduling for Tasks and Resources (LSTR) is designed in [24]. The 9

Journal Pre-proof

algorithm focuses on the scheduling of the cloud resources among the task requesters, 205

with the aim of improving the system throughput and resource utilization.

pro of

In [25], the authors propose a batch-based (DAG) scheduling algorithm, which combines centralized static scheduling and dynamic workflow to improve resource utilization and achieve quasi-optimal throughput on heterogeneous platfroms. Magulur et al. [26] propose several virtual machine configuration scheduling algorithms for cloud 210

computing platforms that reach an arbitrary fraction of cloud capacity. To do so, they employ a stochastic model of a cloud computing cluster in which tasks are performed based on a stochastic process. They introduce as well a set of policies for configuring non-pre-emptive frame-based virtual machines.

215

re-

In [27], Zhang et al. propose a two-stage scheduling strategy to maximize task scheduling performance and minimize unreasonable task allocation in cloud centres. In the first phase, a Bayes classifier is employed to cluster tasks according to historical scheduling data and correspondingly create suitable VMs. In the second phase, dynamic scheduling algorithms are proposed to match tasks with the created VMs.

220

lP

Despite their importance, the above-mentioned approaches overlook the trust levels of the VMs when performing the scheduling process. This might lead to assigning tasks to some poor-performing VMs, thus increasing the makespan of tasks and causing resource wastage from the cloud providers’ site.

urn a

Furthermore, the existing cloud-based scheduling approaches consider the scheduling problem either from the tasks’ perspective or the VMS’ perspective. This is done 225

through setting a priority degree for each task to determine the order in which the task will be scheduled and executed, or by evaluating the QoS of each VM to determine its appropriateness in serving tasks. Nevertheless, by focusing only on the task priorities and ignoring the reliability of the VMs, there would still be a large risk to the overall Quality of Service (QoS) in the presence of untrusted VMs. On the other hand, by focusing only on the VMs side, tasks’ priority levels and deadline constraints would be ignored, which would menace customers’ satisfaction.To tackle these limitations,

Jo

230

we propose in this work a multidisciplinary solution that employs techniques from trust modeling, machine learning, statistics and decision-making to improve the performance of big data task scheduling over the cloud, while considering the metrics and 10

Journal Pre-proof

235

constraints of both the underlying tasks and cloud systems.

pro of

3. Problem Definition

Let Y = {y1 , y2 , ..., yq } be a finite set of jobtrackers, where each jobtracker yi ∈ Y

monitors and manages a set of tasktrackers S = {s1 , s2 , ..., sx } responsible for a set

V = {v1 , v2 , ..., vm } of VMs, and let v j be a typical VM in the set. Every tasktracker 240

monitors and analyzes the CPU, RAM, disk storage, and network bandwidth utilization of each v j ∈ V . As a first stage, the jobtracker seeks to establish trust relationships toward the tasktrackers based on the average frequency and response time of the heart-

beats. This allows the cloud system to build an initial trust value for each v j denoted

245

re-

as InitialTrust j . To do so, the cloud system collects log files from jobtrackers which contain information on the behavior of the tasktrackers (i.e., heartbeats information) to learn about the active and inactive ones. In order to complete the final trust decision, the cloud system monitors the VMs found trusted in the initial trust establishment phase to

lP

inspect their CPU, RAM, network bandwidth, and disk storage utilization, and applies the Interquartile Range (IQR) statistical measure to identify any abnormal utilization 250

on each v j and hence compute another initial trust score for each VM. Note that c(v j ) represents the CPU consumption of v j , r(v j ) represents the RAM consumption of v j , s(v j ) represents the disk storage consumption of v j , and b(v j ) represents the bandwidth

urn a

consumption of v j . Finally, the MCMC Gibbs Sampler method is employed to aggregate both initial trust values, come up with a final trust score for each VM, and rank these VMs based on their trust scores.

Jo

255

11

Journal Pre-proof

Start

Arrival of big data tasks

Cluster tasks based on their execution cost using the K-means clustering technique from machine learning (Equation (8))

3 clusters

5 clusters

Compute the priority level of each task using the multi-criteria decision-making technique (Equation (10))

pro of

Cluster tasks based on their resource requirements using the statistical percentile method (Algorithm (4))

Compute the initial trust value of each VM using the heartbeats average frequency ratio (Equation (4))

Compute the initial trust value of each VM using IQR statistical technique (Algorithm (3))

FirstInitialTrust

SecondInitialTrust

Compute the final aggregate trust value of each VM using Gibbs sampler technique (Equation (7))

Rank the VMs in a queue sorted in descending order based on their trust value

Take the first VM v_j from the queue V

lP

Take the first task t_i from the queue T

re-

Rank the tasks in a queue T sorted in descending order based on their priority levels

urn a

Apply the scheduling algorithm (Algorithm 5)

Yes

Does the trust value of VM v_j exceed the trust threshold Ω

No

Terminate VM v_j

End

Figure 2: Flowchart of the scheduling process

Jo

Having computed the trust values of each VM and ranked them based on their trustworthiness, the next step is to determine the priority level of each task. Let T = {t1 ,t2 , ...,tn } be a set of tasks submitted by certain users and let ti be a typical

task in the set. Moreover, let Rzi represent the set of requirements of each task ti in 12

Journal Pre-proof

260

terms of parameter z (CPU, RAM, bandwidth, and disk storage). We first propose a new method based on the percentile ranking to cluster the tasks based on their CPU,

pro of

RAM, network bandwidth, and disk storage requirements Rzi . Thereafter, we propose a K-means clustering technique to cluster tasks based on their monetary prices. Finally, we advocate a multi-decision criteria approach which capitalizes on the two aforemen265

tioned clustering methods to determine the priority level of each task and rank these tasks based on their priority score.

Having ranked VMs based on their trust scores and tasks based on their priority levels, the final step is to execute the actual scheduling process. To do so, we propose a new scheduling approach which takes into consideration the trust values of the VMs, the priority levels of tasks, and the resource requirements and availabilities to match

re-

270

each task to the corresponding VM, so as to minimize the makespan and monetary cost. Note that the different notations used throughout the paper are summarized in Table 1 and the model diagram is shown in Fig. 2.

275

proach

lP

4. BigTrustScheduling: Description of the Proposed Trust-Aware Scheduling Ap-

4.1. First Stage: Building trust on VM

4.1.1. First Phase: Building Initial Trust on TaskTracker

urn a

Scheduling decisions are made by a master node called jobtracker, while the worker nodes, called tasktrackers are responsible for the task execution. The jobtracker main280

tains a queue of the jobs currently running, states in a tasktrackers cluster and the list of tasks assigned to each tasktracker. Each tasktracker periodically reports its status (active, inactive) to the jobtracker in the form of a heartbeat message as shown in Fig. 3. The content of the heartbeat message is:

Jo

• Lists of completed or failed tasks. 285

• Progress report of the tasks currently being carried out at the issuer tasktracker. • Boolean flag (accept new tasks) that indicates whether the sender jobtracker can assign an additional task or not. This indicator is set to true if the number of 13

Journal Pre-proof

Table 1: Notations Significance

pro of

Set of jobtrackers. Set of tasktrackers. Set of virtual machines. Set of tasks provided by users. a performance parameter z ∈ M = {CPU, RAM, bandwidth, disk storage}. Set of heartbeats provided by a tasktracker for the VM v j . Average heartbeat response time for the VM v j . Initial trust believed by the cloud system in the trustworthiness of the virtual machine v j . Habitual heartbeat response time from the historical log file for VM v j . Heartbeat response frequency ratio for VM v j . Historical average heartbeat frequency for VM v j . Average response time ratio for VM v j . Response time of heartbeat x for VM v j . Time moment h ∈ N. Priority degree of task ti ∈ T . Matrix storing the trust values of all VMs V across time moments τ. Trust value of VM v j at time moment τh . Set of trust mean values of VM v j across time moments. Resource requirement cluster r ∈ {1, 2, 3} for low, medium, and high, resp. Task cost cluster u ∈ {1, . . . , 5} for very low, low, medium, high, and very high, resp. A (3 × 5)-matrix of requirements for task ti ∈ T in terms of the performance parameter z. Position of a particular task belonging to the clusters ar and cu in the matrix Dz (T ) (Mru = Dz (T )[r, u]). Hadoop cluster. Rack that contains at least one VM that stores the data chunk d pertaining to task ti . File system’s block size (e.g., 128MB in case of Hadoop 2.0). The VM on which the data chunk d pertaining to task ti is stored. Big data file that needs to be stored. Size of file F. Trust threshold according to which a certain VM is considered to be trusted or not.

re-

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

lP

Symbol Y S V T z ϕj Φj InitialTrust j ψj Qj γj Wj η xj τh Pi χV,τ e jh µ ar cu Dz (T ) Gru CL RKdti ϖ vtdi F |F| Ω

urn a

running tasks at the tasktracker is below the VMs capacity. The jobtracker collects the heartbeats received from the set of tasktrackers installed on each VM. If a heartbeat is not received from a tasktracker within a certain interval of time, this means that the VM should be deemed dead or down. Hence, to calculate the heartbeat response frequency ratio Q j , let γ j be the habitual historical average heartbeat frequency for VM v j , and ϕ j be the current heartbeat frequency for VM v j . ϕj × 100% γj

Jo

Qj =

14

(1)

Journal Pre-proof

and the average heartbeat response time Φ j would be: Nj

∑x=1 η xj Nj

pro of

Φj =

(2)

where η xj represents the response time of the heartbeat x for the VM v j ∈ V , and N j is

290

the total number of received heartbeats for v j . Then, the average response time ratio can be calculated as follows: Wj =

Φj × 100% ψj

(3)

where ψ j is the habitual heartbeat response time of VM v j from the historical log file.

re-

Finally, the first trust value of VM v j (FirstInitialTrust j ) is computed by dividing the summation of the average heartbeats response frequency ratio, and the average response time ratio by two. The reason behind this formula is because we assume that both the response frequency ratio and average response time ratio have equal impor-

lP

tance.

FirstInitialTrust j =

Q j +W j 2

(4)

The second trust value of VM v j (SecondInitialTrust j ) will be computed later in Al-

Jo

urn a

gorithm 3.

15

Journal Pre-proof

User

Job Tracker

Job Assignment to Job Tracker

User

Task Tracker

𝑡1

𝑡2

pro of

Clients

Task Tracker

𝑡3

Task Tracker

Task Tracker

𝑡5

𝑡4

𝑡7

𝑡8

re-

T 1

𝑡6

Figure 3: Tasks assignment and reporting process

4.1.2. Second Phase: VMs Monitoring 295

In this phase, the tasktracker monitors the CPU, RAM memory, network band-

lP

width, and disk storage consumption of the VMs to construct a consumption dataset for each VM. It then employs the IQR statistical technique to recognize any abnormal consumption. The IQR is a measure of variability whose basic idea is to divide a given set of data into disjoint quartiles (i.e., Q1, Q2, Q3). The first quartile Q1 is the value in the dataset that 25% of the values are smaller than it. The second quartile Q2 repre-

urn a

300

sents the median value of the dataset. The third quartile Q3 represents the value in the dataset that 25% of the values are greater than it. The IQR is obtained by subtracting the first quartile from the third quartile. The reasons for choosing the IQR measure for our problem are (1) its robustness to disordered and unorganized data and (2) its simple 305

and light nature that does not require significant computational effort. Algorithms 1, 2, and 3 (executed by the jobtracker) are introduced to describe the aforementioned process. In particular, Algorithm 1 determines the VMs whose consumption exceeds the

Jo

normal maximal consumption, and Algorithm 2 is introduced to determine the VMs whose consumption goes down the normal minimal habitual consumption (e.g., failed

310

VMs). Based on the results obtained from Algorithm 1 and Algorithm 2, Algorithm 3 is introduced to compute the initial trust of the VMs. After having collected the 16

Journal Pre-proof

CPU, memory, disk storage, and bandwidth consumption for each VM, the jobtracker calculates the median usage of each VM for each corresponding metric (e.g., CPU) (Al-

315

pro of

gorithms 1 and 2 - lines 15 and 16 respectively.). Then, based on the calculated median value, the first and third quartiles Q1 and Q3, respectively, are derived for each metric (Algorithms 1 and 2 - lines 16-17 respect.). Using these quartiles, IQR is calculated by subtracting Q3 from Q1 and multiplying the value obtained by 1.5 (Algorithms 1 and 2 - lines 18 and 19 respect.). Intuitively, this means that any value greater than one and a half times the upper quartile or lower than one and a half times the lower quartile shall be considered an outlier according to the Tukey’s analysis [28].

Jo

urn a

lP

re-

320

17

Journal Pre-proof

Algorithm 1 VMs Upper Consumption Monitoring Inputs: 1: v j : a VM being monitored by the cloud system

pro of

2: M = {CPU, memory, diskstorage, and bandwidth}: the set of V M’s metrics to be analyzed by the cloud system

3: δ : size of time window after which the algorithm is to be repeated Variables:

4: M zj (t): a table recording the amount of each metric z ∈ M consumed by v j during the time interval [t − δ ,t]

5: x¯zj (t): the median consumption of z ∈ M by v j during the time interval [t − δ ,t]

6: Q1zj (t): the 1st quartile consumption of z ∈ M by v j during the time interval [t − δ ,t]

7: Q3zj (t): the 3rd quartile consumption of z ∈ M by v j during the time interval [t − δ ,t] 8: IQRzj (t): the IQR consumption of z ∈ M by v j during the time interval [t − δ ,t]

re-

9: Lzj (t): the upper consumption limit of z ∈ M v j during the time interval [t − δ ,t]

10: OverUsezj : sum of V M v j ’s unusual high consumption of z ∈ M (initialized to 0)

11: CountOverUsezj : a counter enumerating the occurrence of unusual high consumption of z ∈ M by v j (initialized to 0)

12: AvgOverUsezj : V M v j ’s average unusual high consumption of z ∈ M Outputs: of this z

lP

13: PropOverUsezj : V M v j ’s unusual consumption of z ∈ M proportionally to the upper consumption limit 14: |OverusedMetrics j |: the number of metrics that v j over consumed such that |OverusedMetrics j | ≤ |M| 15:

for each metric z ∈ M do

Compute the median x¯zj (t) of M zj (t)

Find Q1zj (t) as the median of M zj (t)’s lower half

17:

Find Q3zj (t) as the median of M zj (t)’s upper half

18: 19: 20: 21: 22: 23: 24: 25:

Compute IQRzj (t) = (Q3zj (t) − Q1zj (t)) × 1.5 Compute Lzj (t) = IQRzj (t) + Q3zj (t)

for each data point y ∈ M zj (t + µ) do if y > Lzj (t) then

OverUsezj = OverUsezj + y

CountOverUsezj = CountOverUsezj + 1

end if

if CountOverUsezj > 0 then

AvgOverUsezj = OverUsezj /CountOverUsezj

Jo

26:

urn a

16:

27:

PropOverUsezj = Lzj (t)/AvgOverUsezj

28:

|OverusedMetrics j | = |OverusedMetrics j | + 1

29:

end if

30:

end for

31:

end for

18

Journal Pre-proof

Algorithm 2 VMs Lower Consumption Monitoring Inputs: 1: v j : a VM being monitored by the cloud system

pro of

2: M = {CPU, memory, diskstorage, and bandwidth}: the set of V M’s metrics to be analyzed by the cloud system

3: δ : size of time window after which the algorithm is to be repeated Variables:

4: M zj (t): a table recording the amount of each metric z ∈ M consumed by v j during the time interval [t − δ ,t]

5: x¯zj (t): the median consumption of z ∈ M by v j during the time interval [t − δ ,t]

6: Q1zj (t): the 1st quartile consumption of z ∈ M by v j during the time interval [t − δ ,t]

7: Q3zj (t): the 3rd quartile consumption of z ∈ M by v j during the time interval [t − δ ,t] 8: IQRzj (t): the IQR consumption of z ∈ M by v j during the time interval [t − δ ,t]

re-

9: W jz (t): the lower consumption limit of z ∈ M by v j during the time interval [t − δ ,t]

10: UnderUsezj : sum of V M v j ’s unusual low consumption of z ∈ M (initialized to 0)

11: CountUnderUsezj : a counter enumerating the occurrence of unusual low consumption of z ∈ M by v j (initialized to 0)

12: AvgUnderUsezj : V M v j ’s average unusual low consumption of z ∈ M Outputs:

limit of this z

lP

13: PropUnderUsezj : V M v j ’s unusual consumption of z ∈ M proportionally to the lower consumption 14: |UnderusedMetrics j |: the number of metrics that v j over consumed such that |UnderusedMetrics j | ≤ |M|

15: for each metric z ∈ M do

Compute the median x¯zj (t) of M zj (t)

17:

Find Q1zj (t) as the median of M zj (t)’s lower half

18: 19: 20: 21: 22: 23: 24: 25:

Find Q3zj (t) as the median of M zj (t)’s upper half Compute IQRzj (t) = (Q3zj (t) − Q1zj (t)) × 1.5 Compute W jz (t) = Q1zj (t) − IQRzj (t)

for each data point y ∈ M zj (t + µ) do if y < W jz (t) then

UnderUsezj = UnderUsezj + y CountUnderUsezj = CountUnderUsezj + 1

end if

if CountUnderUsezj > 0 then

Jo

26:

urn a

16:

27:

AvgUndeUsezj = UnderUsezj /CountUnderUsezj

28:

PropUnderUsezj = W jz (t)/AvgUnderUsezj

29:

30: 31: 32:

|UnderusedMetrics j | = |UnderusedMetrics j | + 1

end if

end for

end for

19

Journal Pre-proof

Algorithm 3 Virtual Machines Monitoring Inputs: 1: v j : a VM being monitored by the cloud system

pro of

2: M = {CPU, memory, diskstorage, and bandwidth}: the set of V M’s metrics to be analyzed by the cloud system

3: δ : size of time window after which the algorithm is to be repeated Variables:

4: M zj (t): a table recording the amount of each metric z ∈ M consumed by v j during the time interval [t − δ ,t]

5: PropOverUsezj : V M v j ’s unusual consumption of z ∈ M (obtained from algorithm 1) proportionally to the upper consumption limit of this z

6: |OverusedMetrics j |: the number of metrics that v j over consumed (obtained from algorithm 1) such that |OverusedMetrics j | ≤ |M| obtained from algorithm 1 to the lower consumption limit of this z

re-

7: PropUnderUsezj : V M v j ’s unusual consumption of z ∈ M (obtained from algorithm 2) proportionally 8: |UnderusedMetrics j |: the number of metrics that v j under consumed (obtained from algorithm 2) such that |UnderusedMetrics j | ≤ |M|

9: U pperInitialTrust j : the initial trust of cloud system regarding v j ’s trustworthiness with respect to the upper limit.

lP

10: LowerInitialTrust j : the initial trust of cloud system regarding v j ’s trustworthiness with respect to the lower limit. Outputs :

11: SecondInitialTrust j : the initial trust of cloud system in V M v j ’s trustworthiness 12: procedure VMM ONITORING repeat

14:

run Alg. 1 on v j , M, δ to obtain PropOverUsezj and |OverusedMetrics j |

15: 16: 17: 18: 19: 20: 21: 22:

if |OverusedMetrics j | = 0 then U pperInitialTrust j = 1

else

U pperInitialTrust j =

end

24:

if |UnderusedMetrics j | = 0 then LowerInitialTrust j = 1

else

LowerInitialTrust j =

25:

end

26:

else

27:

28:

∑z∈M PropOverUsezj |OverusedMetrics j |

run Alg. 2 on v j , M, δ to obtain PropUnderUsezj and |UnderusedMetrics j |

Jo

23:

urn a

13:

SecondInitialTrust j =

∑z∈M PropUnderUsezj |UnderusedMetrics j |

U pperInitialTrust j +LowerInitialTrust j 2

until δ elapses

20

× 100%

Journal Pre-proof

By adding the IQR to the third quartile, the jobtracker computes the upper consumption limit for each underlying metric (Algorithm 1 - line 19). Similarly, by sub-

pro of

tracting Q1 from IQR, the jobtracker determines the lower consumption limit for each underlying metric (Algorithm 2 - line 19). These limits describe the patterns of maxi325

mal and minimal habitual utilization of each VM at a certain time interval, where any future utilization greater/lower than the upper/lower limits would be considered unusual. The jobtracker checks then for any future consumption of the VM at time t + δ to determine whether there exists any consumption that exceeds the computed upper limit (Algorithm 1 - lines 20-21). If so, this result is added to a table that registers

330

the VM’s unusual consumption (Algorithm 1 - line 22), and the average unusual consumption for each metric is then computed (Algorithm 1 - line 26). At the same time

re-

t + δ , the jobtracker also checks for any consumption that falls below the computed lower limit (Algorithm 2 - lines 20-21). If found, these values are added to a table that stores the VM’s unusual low consumption (Algorithm 2 - line 22), and the average 335

unusual consumption for each metric is computed (Algorithm 2 - line 26).Eventually,

lP

the jobtracker computes its initial trust in each VM’s trustworthiness by dividing the sum of average unusual consumptions over all the metrics by the number of metrics that the VM has overused if any (Algorithm 3 - line 17), and underused (Algorithm 3 - line 22). If no metric has been overused or underused, the initial trust in the VM’s trustworthiness would be set to 1 (Algorithm 3 - line 15 and line 20) respectively.Then,

urn a

340

by computing the sum of U pperInitialTrust j and LowerInitialTrust j and dividing it by two (Algorithm 3 - line 25), we obtain the aggregate trust of each VM. Note finally that the whole process is repeated periodically after a certain period of time δ to continuously capture the dynamism in the VMs’ performance and behavior. 345

4.1.3. Third Phase:Rank Aggregation via MCMC Gibbs Samples In this section, we use gibbs sampling to rank the VMs based on there trust values.

Jo

The inputs of this step are the initial trust values obtain from the monitoring algorithms (Algorithms 1, 2, and 3) and the trust values obtained from the heartbeat phase. The average of both trust sources represents the samples at every time moment used to

350

find the likelihood distribution, and we used the last time moment to represent the 21

Journal Pre-proof

prior distribution. Using this information, the objective of the gibbs sampling phase is to come up with the posterior or aggregate trust value of each VM at the current

pro of

time moment (at which we are interested in computing the trust values). Thus, trust is calculated dynamically based on the initial trust values obtained from the monitoring 355

algorithms and the trust value obtained using the heartbeat data. Based on the collected data from phase one and two, each VM can be classified either trusted or not.

We will be using the MCMC methods in a Bayesian framework. Historically, Bayesian statistics has been quite theoretical, as until about twenty years ago, it had not been possible to solve practical problems through the Bayesian approach due to the 360

intractability of the integrations involved5 . In this work, we capitalize on the Bayesian

re-

approach to combine our prior trusted values with the data collected and produce new posterior trust values for each VM. Note that the methodology followed to compute the aggregate trust values for the VMs is inspired by [31].

Let χV,τ be a matrix storing the trust values for all VMs V = {v1 , v2 , . . . , vm } across

365

time moments τ = {τ1 , τ2 , . . . , τn }. We use j  j0 to denote that VM v j is ranked

lP

higher than VM v j0 , which means that v j is considered to be more trusted. Let e jh be the trusted value of VM v j at time moment τh . Furthermore, any e jh having a highly trusted value would have a small position index in the matrix χV,τ , i.e. if e jh > e j0 h ,

this means that j  j0 . For clarification purposes, in the following discussion we use index j to denote the identifier of the VM and index h to denote the corresponding time

urn a

370

moment at which the rank is computed, with m and n denoting the total numbers of VMs being ranked and time moments, respectively. In our work, each ranked VM is associated with relevant covariates (CPU, RAM, bandwidth, and disk storage) that are used as a reference to determine the ranks of the VMs as per Algorithm 3. Table 2 375

shows the trust matrix χV,τ of VMs V in different time moments τ from Algorithm 3. readers are advised to consult the following references [29, 30] for further details.

Jo

5 Interested

22

Journal Pre-proof

Table 2: Trusting lists of VMs in different time moments (the matrix χV,τ ) τ2

τ3

v1

e11

e12

e13

v2

e21

e22

e23

v3 ...

e31 ...

e32 ...

e33 ...

vm

em1

em2

em3

τn

...........................................................

pro of

XXX XXXTime Moments τ1 XXX VMs XX X

...........................................................

e1n

...........................................................

e2n

...........................................................

e3n ...

...........................................................

emn

In Table 2, e jh represents the trust value of VM v j ∈ V from the matrix χV,τ , where

j ∈ {1, 2, ..., m} at time moment h, and e jh > e j0 h if and only if j  j0 . The set µ =

re-

(µ1 , ..., µm ) represents the trust mean values of each VM across time moments. Let λ

be the trust ranking list for all VMs, where these VMs are sorted in descending order on the basis of their trust values. The trust ranking list λ should satisfy the following conditions:

ε jh ∼ N(0, σ 2 ),

lP

e jh = µ j + ε jh ,

(1 ≤ h ≤ n; 1 ≤ j ≤ m)

(5)

where ε jh is a small value and the ε jh for each VM v j and time moment h are jointly independent. Thus, to compute the trust value of the VMs at the current Equation (5) .

urn a

Since we only observe the trust ranking lists λ ’s, multiplying by a constant or adding a constant to all the µ j ’s does not influence the likelihood function, and e j1 h > e j2 h if and 380

only if v j1  v j2 for the time moment h. The rank model in Equation 5 supposes that

the τ’s are independent and identically distributed (i.i.d.) conditionally on µ. Hence, the likelihood function becomes:

n−1

n−1 Z

Jo

p(τ1 , ..., τn−1 |µ j ) = ∏ p(τh |µ j ) = ∏ h=1

n h=1 R

p(τh |e jh , µ j )p(e jh |µ j )de jh

(6)

where, p(τh |e jh , µ j ) = 1. Our goal is to generate an aggregated rank based on an

estimate of µ j in Equation 6. We can employ a Bayesian method, which is more

23

Journal Pre-proof

385

convenient to incorporate prior information, to quantify estimation uncertainties, and to utilize efficient Markov chain Monte Carlo (MCMC) algorithms. With a reasonable

pro of

prior value, the posterior mean of µ j would also be a consistent estimator under the same setting as in Equation 6. Let p(e jn ) denote the prior probability. The posterior distribution of µ j and (e j1 , ..., e jn ) is given in Equation 7 .

n−1

n−1

p(µ j , e1 ..., e jn |τ1 , ..., τn ) = p(e jn ). ∏ p(e jh |µ j ). ∏ 1{τh = rank(e jh )} h=1

390

(7)

h=1

Having aggregated trust values, we can now proceed with reducing the complexity of our solution through terminating the VMs that have low trust values, which means

re-

low performance, for instance dead and slow to respond VMs. This is done through comparing the trust value of each VM with a trust factor (threshold) Ω that is specified by the designer. We dynamically change the input trust factor Ω to measure two 395

metrics: the makespan and monetary cost. In Fig. 4, we vary the value of the trust

lP

threshold Ω to explore its impact on the two metrics. For this experiment, we fix the number of deployed VMs to 50 and the number of tasks to 7, 884. The results clearly show that there is a trade-off between the trust threshold and both the makespan and cost. For a smaller value of the threshold, the makespan and cost are somehow high, 400

since there would be bigger chances of choosing less trusted VMs for storing the data.

urn a

The second main observation is that when Ω exceed 60%, the decrease in the makespan and cost becomes monotone. Therefore, we can conclude that a trust threshold of 60% would be suitable for Algorithm 5 according to the considered scenario since beyond this value, more VMs would be deployed and more resources would be used for a negligible decrease in the two metrics.

Jo

405

24

Journal Pre-proof

3500

160 Avg. Makespan(sec) Avg. Cost($)

140

100 2000 80

1500

60

1000

40

500

0 0.1

Average Cost($)

120

2500

pro of

Average Makespan(sec)

3000

20 0

0.2

0.3

0.4

0.5

0.6

Trust Factor( )

0.7

0.8

0.9

1

4.2. Second Stage: Task Clustering

re-

Figure 4: Comparing different Ω for makespan and monetary cost after terminating the untrusted VMs

quirements

lP

4.2.1. First Phase:Percentile Method for Task Clustering based on their Resource Re-

Having computed the trust scores of the VMs and ranked them based on their trust 410

values, we are interested in this section in clustering the tasks on the basis of their resource requirements. To do so, we take advantage of the percentile method [32], which is used to partition the observations in a certain dataset into percentiles (e.g., 10th per-

urn a

centile) based on their values. The task clustering process is depicted in Algorithm 4. The first step in the algorithm is to sort the resource requirements of the tasks for 415

each corresponding resource metric (i.e., CPU, RAM, storage, and bandwidth) (Step 10 in Algorithm 4). Then, we compute the lower rank (Step 11 in Algorithm 4) and higher rank (Step 12 in Algorithm 4) indices. The lower rank index represents the index that 25% of the observations fall below it. On the other hand, the higher rank index represents the index that 25% of the observations fall above it. These rank indexes are then split into integer (i.e, RankInteger ) and fractional parts (RankFraction ). For

Jo

420

example, the integer part of 3.4 would be 3 and the fractional part would be 0.4. Next, we determine the observations in the sorted dataset that correspond to the RankInteger and RankInteger + 1 (Steps 13-16 in Algorithm 4) and save them into the variables 25

Journal Pre-proof

elementvalue and elementPlusOne respectively (Steps 17-20 in Algorithm 4) . The lower 425

and higher percentile values are then computed by interpolating between elementvalue

pro of

and elementPlusOne values according to the RankFraction . Finally, the actual clustering steps are executed by assigning the tasks whose resource requirements fall under the lower percentile value to the low requirement cluster klow and the tasks whose resource requirements fall above the higher percentile value to the high requirement cluster khigh 430

(Steps 24-27 in Algorithm 4). The remaining values are automatically assigned to the

Jo

urn a

lP

re-

medium requirement percentile kmedium (Step 28-30 in Algorithm 4).

26

Journal Pre-proof

Algorithm 4 Task Clustering Input: T : set of task where ti ∈ T

pro of

n: size of the dataset containing the requirements M = {CPU, memory, diskstorage, and bandwidth}: the set of V M’s metrics to be analyzed by the

cloud system

RzT : the requirements in terms of metrics z ∈ M for the set T of tasks

Rzi : the requirements in terms of metrics z ∈ M for task ti ∈ T

z z z Output: Clusters khigh , kmedium , klow , for each z ∈ M

Procedure Task Clustering

for each metric z ∈ M do Sort RzT

Compute Rank1 E1 (RzT ) such that E1 (RzT ) = 0.25 ∗ (n − 1) + 1

re-

Compute Rank2 E2 (RzT ) such that E2 (RzT ) = 0.75 ∗ (n − 1) + 1

RankInteger1 = dE1 (RzT )e

RankInteger2 = dE2 (RzT )e

RankFraction1 = E1 (RzT ) − dE1 (RzT )e

RankFraction2 = E2 (RzT ) − dE2 (RzT )e elementvalue1 = RzT [dE1 (RzT )e]

lP

elementvalue2 = RzT [dE2 (RzT )e]

elementPlusOne1 = RzT [dE1 (RzT )e + 1] elementPlusOne2 = RzT [dE2 (RzT )e + 1] Compute the lower percentile value such that lower percentile = elementvalue1 + RankFraction1 × (elementPlusOne1 − elementvalue1 ) Compute the higher percentile value such that

urn a

higher percentile = elementvalue2 + RankFraction2 × (elementPlusOne2 − elementvalue2 ) for each ti ∈ T do

if Rzi ≤ lower percentile then z Assign ti to klow

else if Rzi ≥ higher percentile then z Assign ti to khigh

else

z Assign ti to kmedium

Jo

end if

end for

end for

end procedure

27

Journal Pre-proof

4.2.2. Second Phase:K-Means Clustering K-means is one of the most widely adopted unsupervised learning techniques. It

435

pro of

aims at clustering a set of observations in a certain dataset based on their degree of similarity. In this work, we employ K-means to group the tasks into five different clusters (i.e., very low, low, average, high, and very high) based on their monetary costs. K-means operates in an iterative manner through assigning each data observation to one of the K (in our case 5) groups based on some feature similarity criteria (the feature is the cost in our case). Technically speaking, the K-means algorithm starts by 440

defining K centroids, each pertaining to a given cluster. These centroids are chosen in such a way to be the farthest possible from one another. Then, each data observation is

re-

assigned to the nearest centroid, according to the Euclidean distance. Then, the mean of all the observations in each cluster is computed and identified to be the new centroid of the underlying cluster. This process keeps repeating until reaching the case where 445

the same points are being assigned to each cluster in consecutive rounds. Let X= {x j | j = 1, 2, . . . , n}, be the set of n d-dimensional points to be clustered into

lP

a set of K clusters, C = {ck | k = 1, 2, . . . , K}. Mathematically, the K-means algorithm seeks to minimize the squared error function given by: J(C) =

K

∑ ∑

k=1 x j ∈ck

(k x j − µk k)2

(8)

value µk of cluster ck , for all the data points from their respective cluster centers.

Jo

450

urn a

where (k x j − µk k)2 is a chosen distance measure between a data point x j and the mean

28

pro of

Journal Pre-proof

(b) Tasks costs based RAM

lP

re-

(a) Tasks costs based CPU

(d) Tasks costs based DS

urn a

(c) Tasks costs based BW

Figure 5: Clustering based on monetary cost (iteration one)

In Fig. 5, we show the output yielded after applying the K-means clustering technique on the cost values of our tasks, which were computed according to Google Cloud pricing list

6

for the CPU, RAM, bandwidth, and disk storage. By observing Fig. 5,

we can notice that the cost range for the very high cluster (in blue) is between 50$ and 91$ for the CPU, 510$ and 850$ for the RAM, 125$ and 150$ for the bandwidth, and between 440$ and 830$ for the disk storage. The cost range for the high cluster (in

Jo

455

green) is between 30$ and 42$ for the CPU, between 360$ and 500$ for the RAM, 6 https://cloud.google.com/pricing/list

29

Journal Pre-proof

between 95$ and 125$ for the bandwidth, and between 240$ and 415$ for the disk storage. The cost range for the medium cluster (in red) is between 21$ and 29$ for the CPU, between 205$ and 340$ for the RAM, between 54$ and 97$ for the bandwidth,

pro of

460

and between 90$ and 230$ for the disk storage. The cost range for the low cluster (in yellow) is between 13$ and 21$ for the CPU, between 80$ and 205$ for the RAM, between 37$ and 68$ for the bandwidth, and between 40$ and 49$ for the disk storage. Finally, the cost range for the very low cluster (in orange) is between 5$ and 12$ for 465

the CPU, between 2$ and 47$ for the RAM, between 10$ and 36$ for the bandwidth, and between 10$ and 30$ for the disk storage.

Our K-means clustering process is done online every time the tasks to be sched-

re-

uled are determined, which captures the dynamics of tasks’ properties. In order to study the performance of this clustering approach under a dynamic changing environ470

ment in terms of these properties, we show in Fig. 6 the clustering topology obtained after changing the tasks’ properties from one time moment to another to reflect the dynamism that is likely to be encountered in real cloud environments, where each iteration

lP

represents a particular time moment. These results show a different clustering topology from the first iteration to the second. This is because the cost of tasks changes across 475

iterations (i.e., time). Nonetheless, in both cases, our K-means clustering method could group all the tasks into the appropriate clusters without entailing any outlier data point.

urn a

Thus, we can conclude that the proposed K-means clustering approach is efficient and resilient in dynamic environments wherein the tasks’ properties change from one iter-

Jo

ation to another.

30

Journal Pre-proof

160 140

80

pro of

120 100

Price

Price

60 40

80 60 40

20

20

0

0

Cluster

(a) Tasks costs based CPU

(b) Tasks costs based RAM

160

600

re-

140 120

500

100

400

80

Price

Price

Cluster

60

300 200

40

100

lP

20 0

Cluster

(c) Tasks costs based BW

0

Cluster

(d) Tasks costs based DS

480

urn a

Figure 6: Clustering based on monetary cost (iteration two)

4.2.3. Third Phase: Multi-Criteria Task Priority Level Determination Having clustered the tasks based on their resource requirements (Section 4.2.1) as well as their associated costs (Section 4.2.2), we are now ready to determine the priority level of each task. For this end, we adopt a multi-criteria decision-making technique [33] by following the subsequent steps: • Determine the potential criteria (task cost clusters derived in Section 4.2.2) and alternatives (task requirement clusters derived in Section 4.2.1);

Jo

485

• Put the tasks in the decision matrix based on the clusters to which they belong; • Compute the weight of each task, for each parameter (i.e., CPU, RAM, band31

Journal Pre-proof

width, and disk storage), based on its position in the matrix; and • Compute the overall priority for each task, for all parameters.

pro of

490

Formally, consider a priority determination problem with a set A = {a1 , a2 , a3 } of

alternatives and a set C = {c1 , c2 , c3 , c4 , c5 } of criteria. In the set A, a1 represents the low resource requirements cluster, a2 represents the medium resource requirements cluster, and a3 represents the high resource requirements cluster. In the set C, c1 repre495

sents the very low task cost cluster, c2 represents the low task cost cluster, c3 represents the medium task cost cluster, c4 represents the high task cost cluster, and c5 represents the very high task cost cluster. The decision matrix representing the mapping between

a1



re-

the set of potential criteria and set of alternatives is depicted in Matrix (9). c1

c2

c3

c4

c5

G11

G12

G13

G14

G15

G22

G23

G24

G32

G33

G34

lP

  Dz (T ) = a2  G21  a3 G31



  G25   G35

(9)

In Matrix (9), each Gru represents the position of the the task belonging to cluster ar and cluster cu in the decision matrix. Using this matrix, the priority degree formula for a task ti is given in Eq. (10).

∑z∈M Gzru × 100%, 4 × G35

urn a

pi =

(10)

where Gzru is the position of the task ti in the instantiation of the matrix Dz (T ), z ∈ M

and 4 represents the number of task resource parameters considered in this work. For example, assume that we have fifteen tasks T = {t1 , · · · ,t15 }. Then, we would have

four different instantiations of the decision matrix Dz (T ): Dc (T ), Dr (T ), Db (T ), and

Jo

Ds (T ) for the CPU, RAM, bandwidth, and disk storage respectively. Assume that the

32

Journal Pre-proof

a1



c1

c2

c3

t11

t2

pro of

fifteen tasks are distributed in the four decision matrices as follows:

a1

t5

t6

t14

t8

t13

t7

c1

c2

c3

c4

c5

t13

t9

t5

t1

t2

t4

t11

t14

t10

t1

t3 t15





  t12   t10 

  t6   t7

c1

c2

c3

c4

c5

t14

t6

t10

t13

t8

t12

t1

t4

t7

t15

t11

lP

  Db (T ) = a2  t5  a3 t3 a1

t9

re-

  Dr (T ) = a2  t8  a3 t12 a1

c5

t3

  Dc (T ) = a2  t15  a3 t4 

c4



  t9   t2

c1

c2

c3

c4

c5

t11

t13

t15

t5

t3

t9

t10

t6

t12

t8

t1

urn a

  Ds (T ) = a2  t2  a3 t4





  t7   t14

Then, for the sake of simplicity and without loss of generality, we show in the following how the priority degree of three random tasks t1 , t7 and t13 would be computed as per Eq. (10)

P1 = (

4 4 6 12 ) + ( ) + ( ) + ( )/4 × 100% 15 15 15 15

Jo

= (0.27 + 0.27 + 0.4 + 0.8)/4 × 100% = 44%

P7 = (

12 15 6 10 ) + ( ) + ( ) + ( )/4 × 100% 15 15 15 15

= (0.8 + 1 + 0.4 + 0.67)/4 × 100% = 72% 33

Journal Pre-proof

9 1 4 2 ) + ( ) + ( ) + ( )/4 × 100% 15 15 15 15

pro of

P13 = (

= (0.6 + 0.07 + 0.27 + 0.13)/4 × 100% = 27%

Thus, we can conclude that task t1 has a priority degree of 44%, task t7 has a priority 500

degree of 72%, and task t13 has a priority degree of 27%. This means that, according to our scheduling algorithm described in Section 4.3, task t7 should be scheduled first, followed by task t1 , and then task t13 . However, the variation of tasks execution cost on different computing units is not a significant factor in our case. In fact, the cost is highly impacted by the trust value of the VM that will be executing each underlying task. In other words, assigning a certain big data task to a highly trusted VM that will

re-

505

be executing it in a timely fashion would entail a lower cost than executing the task on a less trusted VM with higher execution time. Thus, we reduce the task execution cost through reducing both the execution time and waiting time of each task, regardless of

510

lP

the pricing differences across computing units.

4.3. Third Stage: Trust-Aware Task Scheduling Approach Fig. 7 presents a schematic architecture of the proposed approach that clarifies how each (Hadoop) task is practically matched to the corresponding VMs. We build

urn a

on the heartbeat information obtained from the TaskTrackers on the performance of VMs as well as the behavioral monitoring algorithms (Algorithms 1, 2, and 3) to pro515

pose a dynamic trust-aware scheduling approach that adapts its scheduling decisions to events occurring in the cloud environment. Algorithm 5 is introduced to illustrate the scheduling process of the tasks on VMs. The algorithm takes as inputs the queue of tasks ranked based on their priority degree (as explained in Section 4.2.3), the queue of VMs ranked based on their trust levels (as explained in Section 4.1.3), the VMs specifications in terms of CPU, RAM, bandwidth, and disk storage, and the task resource

Jo

520

requirements in terms of CPU, RAM, band width, and disk storage.

34

Customer tasks

Hadoop Cluster

Cluster log files

re-

Hadoop Master Node

pro of

Journal Pre-proof

Scheduler

lP

Hadoop Slave Nodes

Dynamic VMs monitoring

(Second trust value)

N1

N2

N3

Nx

Queue policies for VMs trust and task priorities value and scheduling roles

Statistics-based trust computation (First trust value)

Task clusters based on resources requirements (Percentile method)

Task clusters based on monetary costs (K-means)

urn a

ee

Figure 7: Architecture of the proposed framework

The algorithm assigns each corresponding task having the highest priority degree to the VM that stores the data chunks on which the task needs to execute (i.e., data-local execution). If this VM does not have enough resources to serve the task, the task is 525

scheduled on another VM belonging the same rack that hosts the VM that contains the

Jo

chunks needed by the task (i.e., rack-local execution). Otherwise, the task is scheduled on any VM in the underlying Hadoop cluster.

35

Journal Pre-proof

Algorithm 5 Task Scheduling

22: 23: 24: 25:

Inputs: T : set of big data tasks where ti ∈ T

QT : queue of tasks ranked based on their priority values QV : queue of VMs ranked based on their trust values

e j : Trust value of VM v j where v j ∈ V CPU xj : current CPU available on VM v j at time x RAM xj : current RAM available on VM v j at time x BW jx : current Bandwidth available on VM v j at time x DSxj : current Disk Storage available on VM v j at time x Rci : the CPU requirements of task ti Rri : the RAM requirements of task ti Rsi : the Disk Storage requirements of task ti Rb i : the Bandwidth requirements of task ti Ω: Trust threshold according to which a certain VM is considered to be trusted or not F: Big data file that needs to be stored | f |: Size of file F

ϖ: File system’s block size (e.g., 128MB in case of Hadoop 2.0) t v i : the VM on which the data chunk d pertaining to task ti is stored d t RK i : the rack that contains at least one VM that stores the data chunk d pertaining to task ti d CL: Hadoop cluster procedure TASK S CHEDULING While CL is running do

Get relevant information on the status of the machines in the cluster from the JobTrackers and TaskTrackers Compute the first initial trust values of the VMs using equation (1)-(4) and then the second initial trust values using algorithms 1-3 a final trust score e j for each VM v j for each VM v j ∈ V do

lP

27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39:

Use the MCMC Gibbs Sampling method to aggregate the first and second initial trust values and come up with

if e j < Ω

terminate v j , i.e., V = V \ {v j }

else

Add v j to Qv end if

Sort the set V of VMs in QV in descending order from the most trusted VMs to the least trusted ones While there is still some data chunks f ∈ F that are not stored yet do hv = head(QV )

|F| Store a chunk d of ϖ on hv hv = hv .next

urn a

26:

pro of

V : set of VMs where v j ∈ V

re-

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:

end while end for

Cluster the set T of big data tasks based on their resource requirements using Algorithm 4 and based on their

monetary costs using K-means as per Section 4.2.2

40:

Determine the priority level of each task ti ∈ T using the multi-criteria task priority level determination method described in Section 4.2.3

41: 42: 43: 44:

for each

ti ∈ QT

repeat

x Rci ≤ CPU xt and Rri ≤ RAM xt and Rb i ≤ BW ti vi vi v d d d ti assign ti to v (i.e., data-local execution) d

if

48: 49: 50: 51: 52: 53: 54: 55: 56:

and Rsi ≤ DSxt vi d

else

x Rci ≤ CPUvx and Rri ≤ RAMvx and Rb i ≤ BWvl l l t could be any of the VMs belonging to RK i d t move d to a VM v p on RK i d assign ti to v p (i.e., rack-local execution)

Jo

45: 46: 47:

Sort the set T of tasks in QT in descending order from the most important to the least important ones

if

and Rsi ≤ DSvx , where VM vl l

else

move assign

d to a VM vq contained on a rack belonging to cluster CL ti to vq (i.e., rack-local execution)

end if

until all tasks in T are assigned to VMs end for end while

36

Journal Pre-proof

We summarize in what follows the main steps of the scheduling algorithm: • Calculate the trust value for every VM based on observations obtained from both heartbeat data and the behavioral monitoring algorithms (Algorithms 1, 2, and

pro of

530

3).

• Compare the trust value of each VM with a trust threshold Ω. If the trust value

of the VM falls below Ω, the VM is considered untrusted and is hence terminated to both avoid assigning data and tasks to untrusted VMs that include dead

535

and low to respond ones, and reduce the input space of the scheduling process. Otherwise, the VM is added to a queue QV , which stores the VMs in descending

re-

order based on their underlying trust values.

• Divide the input big data file into a set of equal chunks (e.g., 128MB as per Hadoop 2.0) and distribute each chunk to the most trusted remaining VM in

540

QV (the VM gets removed from QV after being assigned a chunk to avoid data

lP

duplication on the same VM). This process is repeated until all the chunks of the file are appropriately stored.

• Determine the priority level of each task using the insights obtained from both the percentile and K-means methods.

• Perform the actual task scheduling through matching the tasks based on their

urn a

545

priority levels. The scheduling process considers three cases. In the first case, each task is assigned to the VM that locally stores the data chunks needed for the underlying task (data-local execution). If this VM is busy, the data chunks are moved to any available VM on the same rack and the task gets scheduled on

550

this same VM (intra-rack execution). If neither data-local nor rack-local VMs are available, the data gets moved to an off-rack VM on which the task gets also

Jo

scheduled (inter-rack execution).

37

Journal Pre-proof

5. Experiments

555

pro of

5.1. Experimental Setup To conduct our experiments, we employ a datset collected by Bitbrains7 , a service provider that is specialized in managed hosting and business computation for enterprises. The dataset consists of 1, 750 VMs running in a distributed datacenter and serving up to 8, 260 tasks. The information contained in the dataset include: timestamps, number of provisioned virtual CPU cores, CPU capacity of each VM, CPU us560

age of each VM, amount of memory provisioned to each VM, memory usage per VM, disk read and write throughput per VM, and network received/transmitted throughput. We consider both trusted and untrusted VMs, and vary the percentage of untrusted

re-

VMs from 10% to 50%. Untrusted VMs are considered those that exhibit poor performance in serving customers’ requests and consume abnormal amounts of resources 565

(e.g., CPU, RAM, etc.). To implement the k-means clustering approach, we vary the number of clusters from 1 to 5 to determine the optimal number of clusters that mini-

lP

mizes the percentage of outliers. Based on the experiment results, we could conclude that a number of 5 clusters achieves the best performance on the considered dataset. The objective of the first set of experiments is to study the performance of our so570

lution in terms of execution time and scheduling time. Minimizing these metrics is of prime importance to guarantee efficient execution of big data requests in realistic

urn a

industrial environments. In the second set of experiments, we measure the monetary cost entailed by our approach for serving tasks. This is also an important factor for both providers and customers to help them save resources and money. We compare 575

our approach with three algorithms i.e, SJF, RR, and improved PSO. The SJF approach considers only the size of tasks to derive the order of their execution. The main advantage of SJF lies in its inherent simplicity, and also in the fact that it minimizes the average amount of time each process has to wait until its execution is complete. The

Jo

main limitations of this approach are (1) it might entail a long waiting time for the tasks

580

requiring long execution time if short tasks are constantly added, and (2) it ignores the 7 http://gwa.ewi.tudelft.nl/datasets/gwa-t-12-bitbrains

38

Journal Pre-proof

trust values of the VMs in the scheduling process. The main purpose of the RR approach is to support time sharing system among tasks on the VMs. On the other hand,

pro of

in RR, time slices are assigned to each task in equal portions and in circular order, handling all tasks without priority. The main advantages of RR are that it is simple 585

to implement and is starvation free contrary to SJF. The main disadvantage of the RR scheduling approach is that the throughput entailed by this approach largely depends on the choice of the time quantum’s length. If time quantum is chosen to be very big RR will behave as first come first serve method. On the other hand, if the chosen time would be spend on task switching, meaning lower throughput. The second limitation

590

of RR is that it overlooks the trust values of the VMs when generating the scheduling

re-

decisions. The main idea of the improved PSO approach is to change the weights of the particles with the increase in the number of iterations and to inject some random weights in the final stages to avoid the case where the PSO algorithm generates local optimum solutions. The main advantage of improved PSO is that it involves no over595

lapping and mutation calculation. The search can be carried out by the speed of the

lP

particle. During the development of several generations, only the most optimist particle can transmit information onto the other particles, and the speed of the researching is very fast. However, similar to SJF and RR the improved PSO also ignores the trust of the VMs in the scheduling process, which might entail lower quality of service and higher execution costs.

urn a

600

We implemented our solution with the help of a real Hadoop cluster environment. In fact, we created our own Hadoop cluster which consists of one primary master, one secondary master node, and a variable number of slave nodes (i.e., 10, 15, 20, 25, 30, 35, 40, 45,50). The created nodes have different characteristics in terms of CPU, RAM, 605

disk storage and bandwidth to make them able to support different types of workloads. To evaluate the performance of our solution, we used real specifications for the different created nodes as inputs to our scheduling approach. We then ran our code to determine

Jo

the trust values of the different nodes and the priority level of each corresponding task. Based on these results, we ran the finally selected trusted nodes (by our solution) on

610

the Hadoop cluster, distributed the corresponding big data task files according to the nodes’ trust values as explained in Algorithm 5, and finally assigned the big data tasks 39

Journal Pre-proof

to the corresponding nodes on the basis of their priority levels. The experiments have been performed in a 64-bit Windows 7 environment on a machine equipped with an

615

pro of

Intel Core i7-6700 CPU 3.40 GHz Processor and 16 GB RAM. The implementation code is open source and available on GitHub8 . Table 3 shows the used parameters in our experiments.

Table 3: Simulation parameters

Parameters Number of VMs

10, 15, 20, 25, 30, 35, 40, 45, 50 7,884

VMs performance parameters

re-

Number of tasks

CPU, RAM, bandwidth, and disk storage

Programming language Platform used

Python

Hadoop and CloudSim

lP

Compared approaches

Percentage of untrusted VMs K-means number of clusters

SJF, RR, and improved PSO. 10% − 50% 1-5

urn a

5.2. Experimental Results

In Fig. 8, we measure the total makespan of tasks entailed by the different studied solutions, while varying the number of deployed VMs from 10 to 50. For this experi620

ment, the number of tasks has been fixed to 7, 884 and assume that all the considered VMs are all trustworthy. By observing the figure, we notice that our solution decreases the total makespan compared to the other solutions especially in a low-density VMs environment (i.e., when the number of deployed VMs is relatively small). The reason

Jo

is that in our solution, we provide a matching algorithm that intelligently maps the re-

625

source requirements of tasks with the resource capabilities of the VMs. In other words, 8 https://github.com/gaith7/BigTrustScheduling

40

Journal Pre-proof

we decrease the probability of assigning tasks that require significant amounts of resources to low-performing VMs that would not be capable of efficiently serving those

pro of

tasks. The second observation from this figure is that by increasing the number of deployed VMs, the performance gap between the different solutions tends to be smaller. 630

This can be justified by the fact that the more the number of VMs is (assuming that all these VMs are trustworthy), the larger the resources available to serve tasks would be. Nonetheless, by improving the performance in low-density VMs, our solution would aid cloud providers with decreasing their overall costs through decreasing their need to deploy more VMs.

re-

4000

3500

lP

Makespan(sec)

3000

SJF RR Improved PSO Proposed Alg.

2500

2000

urn a

1500

1000

500 10

15

20

25

30

35

40

45

50

Number of VMs

Jo

Figure 8: Task Makespan: Our solution significantly decreases the tasks makespan compared to the three considered solutions

635

In Fig. 9, we study the monetary cost entailed by the different studied solutions

for the cloud providers, while varying the number of deployed VMs from 10 to 50. For this experiment, the number of tasks has been fixed to 7, 884 and assume that the

41

Journal Pre-proof

considered VMs are all trustworthy. By observing the figure, we notice that our solution considerably help providers reduce their costs compared to the other solutions. The reason for this observation is that our solution decreases the total makespan of tasks as

pro of

640

shown in Fig. 8.

350

SJF RR Improved PSO Proposed Alg.

300

200

re-

Cost($)

250

150

50

0 10

lP

100

15

20

25

30

35

40

45

50

urn a

Number of VMs

Figure 9: Task Costs: Our solution is able to decrease the cost of task execution in comparison with the three considered models

In Fig. 10, we measure the time spent by the tasks waiting in the queue to be assigned to VMs. Like the previous experiments, we vary the number of deployed VMs from 10 to 50, while fixing the number of tasks to 7, 884 and assuming that all the considered VMs are all trustworthy. The main observation that can be drawn from this figure is that increasing the number of deployed VMs leads to a considerable decrease

Jo

645

in the waiting times. While this result is expected, the objective of this figure is to show that even for a small number of deployed VMs and a quite large number of tasks (i.e, 7, 884), the waiting time entailed by our model is still acceptable. For instance, the

42

Journal Pre-proof

total waiting time for 7, 884 to be assigned to 10 VMs is 18 seconds.

pro of

650

18 16

12 10 8 6 4 2 0 10

15

20

re-

Waiting Time(sec)

14

25

30

35

40

45

50

lP

Number of VMs

Figure 10: Waiting Time: The waiting time of tasks prior to being assigned to VMs significantly decreases with the increase in the number of deployed VMs

urn a

In Fig. 11, we measure the makespan of tasks while varying the percentage of untrusted VMs from 10% to 50%. For this experiment, we fix the number of deployed VMs to 50 and the number of tasks to 7, 884. We notice from this figure that starting from 10% of untrusted VMs, our solution can considerably reduce the makespan com655

pared to the other approaches. The reason behind this improvement is that in our solution, we employ a two-step trust establishment mechanism and MCMC Gibbs Samplerbased aggregation technique to derive the trust value of each VM and rank them based

Jo

on their degree of trustworthiness. This is important to avoid assigning tasks to untrusted VMs whose poor performance might affect the overall timespan and success

660

chances of the task execution process. In other words, our scheduling technique maps the tasks to the VMs that enjoy high trust values (i.e., highly performing VMs) to en-

43

Journal Pre-proof

sure minimal makespan and cost. This means that low-trusted VMs would have quite poor chances of being assigned any tasks. On the other hand, despite the effectiveness

665

pro of

of the other compared approaches, these approaches ignore the trustworthiness of the VMs in the scheduling process, thus increasing the chances of tasks being assigned to poorly performing VMs.

SJF RR Improved PSO Proposed Alg.

1500

re-

Makespan(sec)

2000

500 10

15

lP

1000

20

25

30

35

40

45

50

urn a

Percentage of Untrusted VMs (%)

Figure 11: Task Makespan: Our solution significantly decreases the tasks makespan compared to the three considered solutions in the presence of untrusted VMs

In Fig. 12, we quantify the monetary cost of tasks at the side of the providers, while varying the percentage of untrusted VMs from 10% to 50%. For this experiment, like the previous experiments we fix the number of deployed VMs to 50 and the number of tasks to 7, 884. The figure reveals that our solution aids providers in reducing

Jo

670

their monetary costs compared to the other approaches in the presence of untrusted VMs. This can be justified by the fact that, according to Fig. 11, our trust-based approach could decrease the makespan of the task scheduling and execution processes,

44

Journal Pre-proof

in the presence of untrusted VMs. This, in turn, results in reducing the amount of re675

sources spent on serving the tasks, which leads to decreasing the monetary costs of the

pro of

providers.

180 SJF RR Improved PSO Proposed Alg.

160 140

100 80

re-

Cost($)

120

60 40

0 10

lP

20

20

30

40

50

Percentage of Untrusted VMs(%)

urn a

Figure 12: Task Cost: Our solution significantly decreases the costs of tasks compared to the three considered solutions in the presence of untrusted VMs

In Fig. 13, we study the runtime of the four compared solutions, while varying the number of VMs from 10 to 50. For this experiment, we fix the number of tasks to 7, 884. Execution time refers to the time needed by the whole set of algorithms that 680

make up the solution to execute. According to the figure, our approach enjoys lower runtime than the other solutions. The reason is that, although our solution consists

Jo

of many phases (i.e., initial trust computation, VMs monitoring, MCMC-based trust aggregation, K-means-based task cost clustering, percentile-based task requirement clustering, multi-criteria task priority determination, and trust-based task scheduling),

685

adopting a trust-based scheduling approach reduces the chances of tasks failing on

45

Journal Pre-proof

some untrusted VMs and their rescheduling. Moreover, the VMs’ trust-based clustering and tasks priority determination helps our solution improve the mapping between

pro of

tasks’ resource requirements and VMs’ resource availabilities. On the other hand, the SJF solution, for example, focuses only on the size of tasks to determine the order of 690

their execution. This has the limitation of increasing the probability of some tasks being allocated to untrusted VMs, thus raising the need for rescheduling those tasks in cases of VM failures. By carefully observing Fig. 13, also reveals that the improved PSO exhibits the highest runtime amongst the other solutions. The reason is that the improved PSO needs to go through many iterations in order to converge to the desired

1600

1400

1200

1000

urn a

Alg. Execution Time(millisecnd)

1800

SJF RR Improved PSO Proposed Alg.

re-

solution that minimizes the scheduling process makespan.

lP

695

800

15

20

25

30

35

40

45

50

Number of VMs

Jo

Figure 13: Execution Time: The execution time of our solution is lower than that of the three other compared approaches

To show the importance of determining the trust threshold Ω (Section 4.1.3), we

keep in Fig. 14 both trusted and untrusted VMs, and study the impact of having some

46

Journal Pre-proof

untrusted VMs on the performance of big data scheduling. To do so, we provide an in-depth breakdown of our scheduling solution by showing, for each VM, the number of assigned tasks and the makespan of the process of serving these tasks, under a vary-

pro of

700

ing percentage of untrusted VMs. In this figure, the red plots refer to the untrusted VMs, while the blue plots refer to the trusted ones. The main observation that can be drawn from this figure is that our solution assigns only a small number of tasks to the untrusted VMs compared to the trusted ones (for example, in Fig. 14a, the highest 705

number of tasks that were assigned to an untrusted VM was 10 out of 7, 884). The second important observation is that the trusted VMs (in blue) entail small makespan even when assigned a large number of tasks, compared to the untrusted ones that are

re-

assigned a negligible number of tasks. For example, we can notice from Fig. 14a that, among the trusted VMs, the makespan of the VM that was assigned the largest number 710

of tasks (≈ 490 tasks) has been ≈ 22s. On the other hand, among the untrusted VMs, the makespan of the VM that was assigned the largest number of tasks (≈ 15 tasks)

has been ≈ 35s. This shows that even when assigned a considerably smaller number of

Jo

urn a

larger pools of tasks.

lP

tasks, untrusted VMs entail higher makespans compared to the trusted VMs that serve

47

50

50

40

40

Makespan(sec)

30 20 10

30 20 10

pro of

Makespan(sec)

Journal Pre-proof

0 600

0 400

300

50

400

40 20

Number of tasks

0

30

20

100

10

0

50

40

200

30

200

Number of tasks

VM ID

(a) 10% of untrusted VMs

0

10

0

VM ID

(b) 20% of untrusted VMs

60

80

Makespan(sec)

Makespan(sec)

50

60

40

20

40 30 20 10

0 600

0 400 300

50 30 0

10 0

40

30

200

20

100

50

400

40

200

Number of tasks

Number of tasks

VM ID

10

0

VM ID

(d) 40% of untrusted VMs

50

Makespan(sec)

40 30 20 10

0 600

re-

(c) 30% of untrusted VMs

20

0

50

400

40

30

200

20

Number of tasks

0

10

0

VM ID

lP

(e) 50% of untrusted VMs

Figure 14: Task Makespan: We give in this figure a detailed breakdown of the task scheduling process per VM and the resulting makespan, while varying the percentage of untrusted VMs. Note that the red stems represent the untrusted VMs and the blue stems represent the trusted ones.

In Fig. 15, we study the total makespan while varying both the number of tasks

urn a

715

and number of deployed VMs. We notice the total makespan keeps increasing with the increase in the number of tasks. The reason is that the more the number of tasks that need to be served is, the more the time it would take to serve them, for a fixed number of VMs. However, by increasing the number of deployed VMs, we can reduce 720

the makespan needed to serve an increasing number of tasks. Under this scenario, our solution shows the ability to reduce the makespan compared to the other solutions, while showing a greater scalability to the increase in both the number of tasks and

Jo

number of VMs. This is particularly useful in production environments that require serving huge numbers of tasks.

48

Journal Pre-proof

4000 3500

900 SJF RR Improved PSO Proposed Alg.

SJF RR Improved PSO Proposed Alg.

800 700

3000

Makespan(sec)

2000

500

pro of

Makespan(sec)

600 2500

1500

400 300

1000

200

500

100

0

0

1000

3000

5000

7884

1000

Number of Tasks

(a) Number of VMs = 10

Makespan(sec)

400

300

200

0 5000

7884

Number of Tasks

(c) Number of VMs = 50

re-

100

3000

7884

(b) Number of VMs = 35

SJF RR Improved PSO Proposed Alg.

1000

5000

Number of Tasks

600

500

3000

725

6. Conclusion

lP

Figure 15: Task Makespan: We study in this figure the impact of varying both the number of tasks and number of VMs on the overall makespan of tasks

In this paper, we proposed BigTrustScheduling, our trust based scheduling approach that is particularly useful for big data tasks. The main idea is to derive a trust value for

urn a

each VM baseda on its underlying performance, and then prioritize tasks based on their resource requirements and associate prices. The proposed scheduling solution 730

then intelligently maps the tasks to the appropriate VMs in such a way that minimizes the makespan and the cost of tasks processing. Experiments conducted on a real Hadoop cluster environment using real-world datasets collected from the Google Cloud Platform pricing and Bitbrains task and resource requirements reveal that our solution minimizes the makespan up to 59% compared to the Shortest Job First (SJF), up to 48% compared to the Round Robin (RR),

Jo

735

and up to 40% compared to the improved Particle Swarm Optimization (PSO) approaches, under various percentages of untrusted VMs. Besides, our solution reduces the monetary cost by 58% compared to the SJF, by 47% compared to the RR, and

49

Journal Pre-proof

by 38% compared to the improved PSO under various percentages of untrusted VMs. 740

Furthermore, our solution significantly reduces the run time compared to the studied

pro of

solutions. The results presented in this paper can be applicable to other problems. This would be possible through tuning the corresponding metrics in the formulation of the problem and solution, as will as in the experimental environment. In fact, the proposed trust model can be extended to other environments including cloud computing, IoT, 745

parallel computing, etc.

In the future, we plan to extend this work by investigating a deep learning-based scheduling approach which helps automate the scheduling process, especially in dynamic environments such as communities of cloud services [34, 35] wherein tasks are

750

re-

expected to come at a high pace. Bibliography

[1] I. A. T. Hashem, I. Yaqoob, N. B. Anuar, S. Mokhtar, A. Gani, S. U. Khan, The rise of “big data” on cloud computing: Review and open research issues,

lP

Information systems 47 (2015) 98–115.

[2] C.-T. Yang, J.-C. Liu, C.-H. Hsu, W.-L. Chou, On improvement of cloud virtual 755

machine availability with virtualization fault tolerance mechanism, The Journal of Supercomputing 69 (3) (2014) 1103–1122.

urn a

[3] B. Ratner, Statistical and Machine-Learning Data Mining:: Techniques for Better Predictive Modeling and Analysis of Big Data, Chapman and Hall/CRC, 2017. [4] N. Moustafa, G. Creech, J. Slay, Big data analytics for intrusion detection sys760

tem: Statistical decision-making using finite dirichlet mixture models, in: Data analytics and decision support for cybersecurity, Springer, 2017, pp. 127–156. [5] D. Van Ravenzwaaij, P. Cassey, S. D. Brown, A simple introduction to markov

Jo

chain monte–carlo sampling, Psychonomic bulletin & review 25 (1) (2018) 143– 154.

765

[6] X. Jin, J. Han, K-means clustering, Encyclopedia of Machine Learning and Data Mining (2017) 695–697. 50

Journal Pre-proof

[7] D. Cheng, X. Zhou, P. Lama, J. Wu, C. Jiang, Cross-platform resource scheduling for spark and mapreduce on yarn, IEEE Transactions on Computers 66 (8) (2017)

770

pro of

1341–1353. [8] S. L´opez-Huguet, A. P´erez, A. Calatrava, C. de Alfonso, M. Caballer, G. Molt´o, I. Blanquer, A self-managed mesos cluster for data analytics with qos guarantees, Future Generation Computer Systems 96 (2019) 449–461.

[9] A. S. Sofia, P. GaneshKumar, Multi-objective task scheduling to minimize energy consumption and makespan of cloud computing using nsga-ii, Journal of Network 775

and Systems Management 26 (2) (2018) 463–485.

re-

[10] N. Mansouri, B. M. H. Zade, M. M. Javidi, Hybrid task scheduling strategy for cloud computing by modified particle swarm optimization and fuzzy theory, Computers & Industrial Engineering 130 (2019) 597–633. [11] B. Gomathi, K. Krishnasamy, B. S. Balaji, Epsilon-fuzzy dominance sort-based composite discrete artificial bee colony optimisation for multi-objective cloud

lP

780

task scheduling problem, International Journal of Business Intelligence and Data Mining 13 (1-3) (2018) 247–266.

[12] Z. Zhou, F. Li, H. Zhu, H. Xie, J. H. Abawajy, M. U. Chowdhury, An improved

785

urn a

genetic algorithm using greedy strategy toward task scheduling optimization in cloud environments, Neural Computing and Applications (2019) 1–11. [13] G. Rjoub, J. Bentahar, Cloud task scheduling based on swarm intelligence and machine learning, in: 2017 IEEE 5th International Conference on Future Internet of Things and Cloud (FiCloud), IEEE, 2017, pp. 272–279. [14] D. Grzonka, A. Jakobik, J. Kołodziej, S. Pllana, Using a multi-agent system and artificial intelligence for monitoring and improving the cloud performance and security, Future Generation Computer Systems 86 (2018) 1106–1117.

Jo

790

[15] F. Luo, Y. Yuan, W. Ding, H. Lu, An improved particle swarm optimization algorithm based on adaptive weight for task scheduling in cloud computing, in:

51

Journal Pre-proof

Proceedings of the 2nd International Conference on Computer Science and Ap795

plication Engineering, ACM, 2018, p. 142.

pro of

[16] G. Zhong-wen, Z. Kai, The research on cloud computing resource scheduling method based on time-cost-trust model, in: Computer Science and Network Technology (ICCSNT), 2012 2nd International Conference on, IEEE, 2012, pp. 939– 942. 800

[17] W. Wang, G. Zeng, D. Tang, J. Yao, Cloud-dls: Dynamic trusted scheduling for cloud computing, Expert Systems with Applications 39 (3) (2012) 2321–2329. [18] Y. Yang, X. Peng, Trust-based scheduling strategy for workflow applications in

re-

cloud environment, in: P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2013 Eighth International Conference on, IEEE, 2013, pp. 316–320. 805

[19] Y. Cheng, Z. Wu, K. Liu, Q. Wu, Y. Wang, Smart dag tasks scheduling between 1826.

lP

trusted and untrusted entities using the mcts method, Sustainability 11 (7) (2019)

[20] J. A. J. Sujana, M. Geethanjali, R. V. Raj, T. Revathi, Trust model based scheduling of stochastic workflows in cloud and fog computing, in: Cloud Computing 810

for Geospatial Big Data Analytics, Springer, 2019, pp. 29–54.

urn a

[21] M. A. Rodriguez, R. Buyya, A taxonomy and survey on scheduling algorithms for scientific workflows in iaas cloud computing environments, Concurrency and Computation: Practice and Experience 29 (8) (2017) e4041. [22] A. Arunarani, D. Manjula, V. Sugumaran, Task scheduling techniques in cloud 815

computing: A literature survey, Future Generation Computer Systems 91 (2019) 407–415.

Jo

[23] S. Basu, M. Karuppiah, K. Selvakumar, K.-C. Li, S. H. Islam, M. M. Hassan, M. Z. A. Bhuiyan, An intelligent/cognitive model of task scheduling for iot applications in cloud computing environment, Future Generation Computer Systems

820

88 (2018) 254–261.

52

Journal Pre-proof

[24] S. Abirami, S. Ramanathan, Linear scheduling strategy for resource allocation in cloud environment, International Journal on Cloud Computing: Services and

pro of

Architecture (IJCCSA) 2 (1) (2012) 9–17. [25] B. Saovapakhiran, G. Michailidis, M. Devetsikiotis, Aggregated-dag scheduling 825

for job flow maximization in heterogeneous cloud computing, in: Proceedings of the Global Communications Conference, GLOBECOM, Houston, Texas, USA, 2011, pp. 1–6.

[26] S. T. Maguluri, R. Srikant, L. Ying, Stochastic models of load balancing and scheduling in cloud computing clusters, in: INFOCOM, 2012 Proceedings IEEE, IEEE, 2012, pp. 702–710.

re-

830

[27] P. Zhang, M. Zhou, Dynamic cloud task scheduling based on a two-stage strategy, IEEE Transactions on Automation Science and Engineering 15 (2) (2018) 772– 783.

835

lP

[28] O. A. Wahab, J. Bentahar, H. Otrok, A. Mourad, Optimal load distribution for the detection of vm-based ddos attacks in the cloud, IEEE Transactions on Services Computing DOI: 10.1109/TSC.2017.2694426 (2017) X–X. [29] G. E. Box, G. C. Tiao, Bayesian inference in statistical analysis, Vol. 40, John

urn a

Wiley & Sons, 2011.

[30] S. Geisser, Predictive inference, Routledge, 2017. 840

[31] D. Yi, X. Li, J. S. Liu, Bayesian aggregation of rank data with covariates and heterogeneous rankers, arXiv preprint arXiv:1607.06051. [32] R. J. Hyndman, Y. Fan, Sample quantiles in statistical packages, The American Statistician 50 (4) (1996) 361–365.

Jo

[33] R. R. Yager, Categorization in multi-criteria decision making, Information Sci-

845

ences 460 (2018) 416–423.

[34] B. Khosravifar, J. Bentahar, A. Moazin, P. Thiran, Analyzing communities of web services using incentives, Int. J. Web Service Res. 7 (3) (2010) 30–51. 53

Journal Pre-proof

[35] J. Bentahar, Z. Maamar, D. Benslimane, P. Thiran, Using argumentative agents to manage communities of web services, in: 21st International Conference on Advanced Information Networking and Applications (AINA), Workshops Pro-

pro of

850

Jo

urn a

lP

re-

ceedings, Volume 2, May 21-23, Niagara Falls, Canada, 2007, pp. 588–593.

54

Journal Pre-proof

Biographies

pro of

Gaith Rjoub received his M.Eng. degree in Quality Systems Engineering from Concordia Institute for Information Systems Engineering (CIISE), Canada in 2014. He is currently pursuing his Ph.D. degree in Information Systems Engineering at concordia University. His research interests include cloud computing, machine learning, artificial intelligence and big data analytics.

lP

re-

Jamal Bentahar received the bachelor’s degree in Software Engineering from the National Institute of Statistics and Applied Economics, Morocco, in 1998, the M.Sc. degree in the same discipline from Mohamed V University, Morocco, in 2001, and the Ph.D. degree in Computer Science and Software Engineering from Laval University, Canada, in 2005. He is a Professor with Concordia Institute for Information Systems Engineering, Faculty of Engineering and Computer Science, Concordia University, Canada. From 2005 to 2006, he was a Postdoctoral Fellow with Laval University, and then NSERC Postdoctoral Fellow at Simon Fraser University, Canada. He was an NSERC Co-Chair for Discovery Grant for Computer Science (2016–2018). His research interests include cloud and services computing, trust and reputation, game theory, computational logics, model checking, multi-agent systems, and software engineering.

Jo

urn a

Omar Abdel Wahab is an Assistant Professor at the Department of Computer Science and Engineering, Université du Québec en Outaouais, Canada. He holds a Ph.D. in Information and Systems Engineering from Concordia University, Montreal, Canada. He received his M.Sc. in computer science in 2013 from the Lebanese American University (LAU), Lebanon. From 2017 to 2018, he was a postdoctoral fellow at the École de Technologie Supérieure (ÉTS), Canada, where he worked on an industrial research project in collaboration with Rogers and Ericsson. The main topics of his current research activities are in the areas of artificial intelligence, cybersecurity, cloud computing, and big data analytics. He is recipient of many prestigious awards including Quebec Merit Scholarship (FRQNT Québec). Moreover, he is a TPC member of several prestigious conferences and reviewer of several highly ranked journals.

Journal Pre-proof

pro of

Gaith Rjoub

urn a

lP

re-

Jamal Bentahar

Jo

Omar Abdel Wahab

Journal Pre-proof

Highlights

We propose BigTrustScheduling, a trust-aware scheduling solution for cloud computing



Our solution considers VMs’ trust level computation and tasks priority determination



Experiments are conducted using real datasets collected from Google and Bitbrains



Our solution outperforms the state of the art in minimizing the makespan and the cost

Jo

urn a

lP

re-

pro of



Journal Pre-proof

Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

pro of

☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

Jo

urn a

lP

re-

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.