Accepted Manuscript Prediction mechanisms for monitoring state of cloud resources using Markov chain model Mustafa M. Al-Sayed, Sherif Khattab, Fatma A. Omara PII: DOI: Reference:
S0743-7315(16)30018-1 http://dx.doi.org/10.1016/j.jpdc.2016.04.012 YJPDC 3491
To appear in:
J. Parallel Distrib. Comput.
Received date: 10 March 2015 Revised date: 3 April 2016 Accepted date: 21 April 2016 Please cite this article as: M.M. Al-Sayed, S. Khattab, F.A. Omara, Prediction mechanisms for monitoring state of cloud resources using Markov chain model, J. Parallel Distrib. Comput. (2016), http://dx.doi.org/10.1016/j.jpdc.2016.04.012 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
*Manuscript Click here to view linked References
1
Prediction Mechanisms for Monitoring State of Cloud Resources Using Markov Chain Model Mustafa M. Al-Sayed, Sherif Khattab, and Fatma A. Omara1
Abstract—Cloud computing allows for sharing computing resources, such as CPU, application platforms, and services. Monitoring these resources would benefit from an accurate prediction model that significantly reduces the network overhead caused by unnecessary push and pull messages. However, accurate prediction of the computing resources is considered hard due to the dynamic nature of cloud computing. In this paper, two monitoring mechanisms have been developed; the first is based on a Continuous Time Markov Chain (CTMC) model and the second is based on a Discrete Time Markov Chain (DTMC) model. It is found that The CTMC-based mechanism outperformed the DTMC-based mechanism. Also, the CTMCbased mechanism outperformed the Grid Resource Information Retrieval (GRIR) mechanism, which does not employ prediction, and a prediction-based mechanism, which uses Markov Chains to predict the time interval of monitoring mobile grid resources, in monitoring cloud resources. Keywords: Markov Chains, Cloud Computing, Resource Monitoring
1
M. M. Al-Sayed is now with Department of Computer Science, Faculty of Computers and Information, Minia University, Egypt (e-mail:
[email protected]), mobile number +201120136476. S. Khattab is now with Department of Computer Science, Faculty of Computers and Information, Cairo University, Egypt (e-mail:
[email protected]). F. A. Omara is with Department of Computer Science, Faculty of Computers and Information, Cairo University, Egypt (e-mail:
[email protected]).
1. INTRODUCTION Recently, distributed computing systems have enabled a high level of resource sharing. Types of these systems include clusters, grids, and cloud computing, which allows the users to access large amounts of computing power in a fully virtualized manner, through resource pooling and a single system view. These systems deliver resources as a utility. Cloud computing provides services to the users through a “pay-as-you-go” manner with automatic elasticity. Therefore, accurate monitoring of resources consumed by the users has a great effect on accounting, billing, and system performance [1] [2] [3]. Monitoring resources in cloud environments is considered the key tool for controlling and managing hardware and software infrastructures. Monitoring provides information and Key Performance Indicators (KPI) for both platforms and applications. Also, SLA violations can be detected by continuous monitoring. Therefore, monitoring resources in cloud environments is an important task for both cloud Providers and cloud Consumers [4]. The monitoring process is used to collect the resources information (e.g., CPU load, memory usage, network bandwidth), which help for job scheduling, load balancing, event prediction, fault detection and recovery tasks. Because the inaccurate information would affect the performance of these tasks, the monitoring process has to ensure the consistent state. This monitoring scheme needs to adapt to the dynamic nature of cloud computing that stems from
2
dynamic user and system loads, and elastic resource provisioning [5] [6]. A monitored resource (termed the producer in this paper) sends its monitored state to one or more master nodes (termed the consumer in this paper) over the network. The producer may proactively send its state to the consumer or wait until the consumer requests a state update. The former case follows a push model, whereas the latter is a pull model. Monitoring cloud resources would benefit from an accurate prediction model that significantly reduces the network overhead caused by unnecessary push and pull network messages. For instance, using periodic monitoring updates with a fixed interval may cause the monitored resource to push (send) unnecessary network messages containing the same information. Fixed-period monitoring may alternatively cause the master node to send unnecessary pulls (requests). However, if the Consumer correctly predicts, up to a predefined error tolerance degree (ETD), a monitored value, the Producer would not need to send this value. However, accurate prediction is hard due to the dynamic nature of system load and resource provisioning, among other factors. In this paper, we have developed two resource monitoring mechanisms for cloud computing that are based on Markov Chains. The first mechanism uses Continuous-Time Markov Chains (CTMC), whereas the second uses Discrete-Time Markov Chains (DTMC). Both mechanisms use pushing of resource state (updates) when the difference between the predicted and actual values exceeds the ETD. These updates are used to tune the prediction model at the Consumer. In both mechanisms, a Kstate Markov model is developed based on training data. The used data set to evaluate our mechanisms has been released by Google. It contains measurement data of 29 days that were collected from a computing Cluster in May 2011 [7]. Half
of these data was used for training. Furthermore, we have compared the accuracy and the network overhead of our mechanisms with the push-based mechanism Grid Resource Information Retrieval (GRIR) [8] and a prediction-based mechanism [6], which uses Markov Chain Model (MCM) to predict the time interval of monitoring mobile resources (for brevity, we will call it MTI mechanism), after deploying them to monitor cloud resources. The implementation results showed that the CTMC-based mechanism achieved better performance than that was achieved by the DTMC-based mechanism. Also, the CTMCbased mechanism achieved better performance than that was achieved by the GRIR and the MTI mechanisms. This paper is organized as follows; background about Markov chains and related works are presented in Sections 2 and 3, respectively. The proposed mechanisms are presented in Section 4. Section 5 describes the evaluation of our proposed mechanisms. Finally, the conclusion and future work are presented in Section 6. 2. MARKOV CHAINS The artificial intelligence community has adopted stochastic technologies for machine learning. Bayes’ rule is the basis for these technologies where Bayesian approaches can expect future events from prior events. Markov Chain Model (MCM) is one of these technologies. In MCM, the probability of an event occurrence at any time is a function of the probabilities of the occurred events at previous time periods [9]. Markov chains may be discrete-time or continuous-time. In the following subsections, we give a brief introduction about the key main property of Discrete-Time Markov Chain (DTMC), and Continuous-Time Markov Chain (CTMC).
3
2.1 Discrete-Time Markov Chain (DTMC) Markov property in DTMCs implies that [10]:
where Pij is the one-step transition probability from state i to state j at a specific interval. These transition probabilities could be expressed in a K×K transition matrix, where K is the number of states [11]. Assuming a transition matrix with an initial probability distribution vector π, the probability that the chain is in a state i after n steps is the ith entry in the vector πn [12]. Although matrix self-multiplication can be computed in O(log(n)) instead of O(n), such performance gain would be tangible for large values of n. The value of n in our work was small (~40), which did not warrant the use of optimized multiplication. 2.2 Continuous-Time Markov Chain (CTMC) According to the CTMC, the state transitions happen at any period of time [13]. CTMCs move from one state to another according to a DTMC model. The time spent in each state is exponentially distributed with parameter λ. The CTMC model is described by a transition matrix T = [Pij], as in DTMC model, together with a set of time rates {λij : i, j S}, where S is the state space that has been expressed in a transition rate matrix R = [ ]. Every time state i is visited, the chain spends there, on average, E ( ) = 1 / λi time units before moving on [14] [11]. 3. RELATED WORKS The existing monitoring systems are based on three basic models; Push, Pull, and Hybrid. In Push model, Producers push the updated information to the Consumer under some trigger conditions based on a Service Level Agreement (SLA). In Pull model, Consumer requests resources information from Producers when it
needs. The Hybrid model allows switching between Push and Pull models according to a specific threshold [8] [15]. The monitoring systems of distributed systems can also be classified into three types based on these models; (1) Push-based monitoring systems, such as Nagios Service Check Adaptor (NSCA) [16], PCMONS [17], and Lattice [18]; (2) Pull-based monitoring systems, such as Nagios Remote Plugin Executor (NRPE) [19] and Hyperic HQ [20]; and (3) Hybrid systems, such as Ganglia [21] and Monalisa [22]. According to [23], the monitoring mechanisms based on the Push model are considered the most suitable for large distributed systems, where the update messages are pushed to the Consumer based on the Producer state. This rescues the network and the Consumer from useless messages. On the other hand, in the Pull model, each pull operation is offset by another push operation. This doubles the consumed bandwidth of the network and the load on the Consumer. Hence, our proposed monitoring mechanisms are based on the Push model, where the Producer pushes its updates to the Consumer when the difference between the actual measurement and the predicted state values, using a Markov Chain Model (MCM), exceeds a certain Error Tolerance Degree (ETD) limit, which is defined according to the users’ requirements. In an attempt to minimize unnecessary and useless update messages and at the same time maximize the consistency between the Producer and the Consumer, Chung and Chang [8] has proposed a resource monitoring mechanism for Grid computing called Grid Resource Information Retrieval (GRIR). GRIR improves on the Push model. The authors there have examined a set of data delivery protocols; the Offset-Sensitive Mechanism (OSM), Time-Sensitive Mechanism (TSM), and hybrid Announcing with Change and Time Consideration (ACTC). They found that the
4
hybrid protocol outperforms other protocols, because it is based on tuning the updating time interval and the threshold dynamically when any change occurs. A comparative study has been conducted between the GRIR (based on the ACTC protocol), and the monitoring mechanism of Ganglia system. According to the comparative results, it is found that the GRIR mechanism achieved a high quality and a small degree of communication overhead [23]. So, the GRIR mechanism will be considered as a benchmark to evaluate the performance of our proposed mechanisms. MCM has been widely used to model sequential events, such as natural languages, human speech, and animal behavior. Also, the Markovian property (i.e., the conditional distribution of the next state is based only on the current state) satisfies the memory less property, which would be beneficial during continuous monitoring. On the other hand, it has been used, in a limited range, to monitor resources in large distributed environments to overcome the high level of network overhead by the pull-based systems [6] [13] [14]. Examples of these limited attempts, which are the base of the work in this paper, are discussed in the following paragraphs. Assigning computing processes to mobile devices suffers from some obstacles, such as instability of wireless communication, limitation of power supply, and low communication bandwidth. To ensure the availability of these devices, a continuous monitoring mechanism is required. On the other hand, if the monitoring time interval is very short, the communication overhead will be increased. In an attempt to adjust the monitoring time interval dynamically according to the state of the monitored mobile devices, Park et al [6] has proposed a monitoring mechanism using MCM to predict the monitoring time interval (MTI). Based on the predicted state
and the job processing time, the monitoring time intervals are adjusted. Job processing time is one of the main factors of MTI. The MTI mechanism includes the following steps: 1- Through a fixed time interval (e.g., an hour, a day, a week, or a month), the previous state information is collected. 2- Probability distribution vector of the initial state is concluded from the collected information. 3- Probability distribution vector of next state j is predicted from the probability distribution of the current state i. 4- Monitoring time interval is calculated by a job processing time and the usage limitation rate of monitored resource. Calculating monitoring time interval, in MTI mechanism, differs based on the predicted state. MTI is based on an MCM prediction model of only three states of resource usage; stable state (70% at most), unstable state (greater than 70% and less than or equal 90%), and disable state (greater than 90%). So, it is difficult to use this mechanism in a cloud environment whereby requests to allocate resources change rapidly. In an attempt to study the demand-response home energy management and distribution network simulation of smart buildings, Haider et al. [11] has developed energy consumption MCMs for smart buildings. These models are based on CTMCs and DTMCs for different periods of the day. But, this mechanism is restricted to only electrical loads that are caused by the prospective widespread of smart buildings. In this paper, we will study the effect of implementing this scheme on the performance of monitoring cloud resources. According to the work in this paper, the MCM has been used to develop a prediction mechanism for monitoring resources on the cloud environment.
5
4. MONITORING MECHANISMS BASED ON MARKOV CHAIN In this paper, we have developed two monitoring mechanisms to predict the monitored states of the cloud resources. These mechanisms are CTMCbased and DTMC-based mechanisms. Both mechanisms push updates to the Consumer when the difference between the predicted and the actual values exceeds the ETD value. These updates are used to tune the prediction model at the Consumer. In both mechanisms, a K-state Markov model is developed based on training data. The construction of our prediction models and the algorithmic steps of both proposed mechanisms will be described in details in the following subsections. 4.1 Prediction Models Construction The construction of our K-state prediction models is based on training data, are described as follows: (1) As an initial step, K centroids are generated from the given training data using a clustering algorithm. In our experiments, we used the Kmeans algorithm due to its speed. Other clustering algorithms with better accuracy are more expensive. This is unsuitable for the
dynamic nature of cloud computing. (2) The centroids are used as quantization levels for the training data [24]. (3) The quantized dataset is used for deriving the transition rate and the transition probability matrices of our prediction models as mention in section 2. 4.2 The Algorithmic Steps of The proposed Mechanisms The proposed mechanisms are composed of two algorithms; one run at the Producers and the other at the Consumer. The two algorithms run simultaneously. The implementation of the two algorithms will be discussed below. According to the flowchart in Figure 1, the algorithmic steps of our proposed mechanisms of the Producers and the Consumer are different. The Producer algorithm pushes an update to the Consumer when the difference between the current measured value and the predicted value exceeds a predefined ETD value. The predicted value is computed as follows: 1) The value of the last pushed update is quantized using the K-means clustering algorithm to determine the level (among the K
Figure 1: The Algorithmic Steps of Our Proposed Mechanisms.
6
levels). This level is used as a start state for our prediction models. (2) These models are used to predict the next monitored state value from the last one. The Consumer algorithm, after the expiry of the prediction time period, checks if there is an update from the Producer. When there is an update, the value of this update is associated with the current time unit and the quantized level of this update is used as a start state in the next time units until the reception of a new update or the next expiration of the prediction timer. If there is no update, the current time unit will be associated with a predicted value using our prediction models. The implementation of both algorithms includes two phases. The following subsections contain more details about the implementation of them. Initial Phase As shown in Figure 1, the initial phase has the following steps: (1) The information pertaining to the history of the monitored resources (training data) is collected during a fixed period (e.g., a week, two weeks, a month, or a year). (2) Based on the collected training data from Step “1”and number of states K, which are considered as inputs in this phase, K centers are generated using the K-means clustering algorithm. (3) Based on the generated K centers from Step “2”, the training data are quantized using the K-means clustering algorithm in order to produce a quantized dataset. (4) Based on the quantized dataset from Step “3”, the Transition Matrix T and the Rate Matrix R are constructed. T, used in the DTMC-based mechanism, includes transition probabilities from state i to state j (include i = j). But, T, used in the CTMC-based mechanism,
includes transition probabilities from state i to state j (exclude i = j). (5) The ETD value is defined according to the user requirements and the status change of the monitored resources. Also, the start state will be defined in this phase. Second Phase at Producer side As stated before, the implementation of the second phase of the Producer is different from the Consumer in few steps. The implementation of this phase at the Producer consists of the follow steps: (1) Every unit of time, the monitored state value P is predicted according to the predefined matrices (T and R) and the last measured value. (2) The current actual measured value C is compared with P from Step “1”. When the difference between C and P is smaller than the predefined threshold ETD value. On the other hand, when the difference between C and P is greater than the ETD value, a. The C value is quantized and its state is assigned to the start state. b. The initial probability distribution vector is reinitialized with the new start state. c. Finally, the C value is pushed to the Consumer node in order to adjust its prediction path. (3) The Producer waits a unit of time and then repeats Step “1”. Second Phase at Consumer side Presenting information about the monitored resources in a higher accuracy as possible is the main goal of the second phase implementation at the Consumer. So, the Consumer predicts this information continuously as long as there aren’t any updates from the Producer. Otherwise, the Consumer adjusts its prediction path when updates are arrived. The implementation of this
7
phase at the Consumer consists of the following steps: (1) Every unit of time, the state value P is predicted according to the predefined matrices (T and R) and the last measured value. (2) After the termination of the predefined unit of time, the Consumer checks for updates from the Producer. If there is not any update from the Producer, the P value will be associated to the current unit of time and the Consumer will wait a unit of time and repeats Step “1”. On the other hand, if there is an update from the Producer, the sub-steps (a) and (b) of the implementation of the second phase at the Producer side will be implemented. Finally, the value of the received update C is associated to the current unit of time. (3) The Consumer waits a time unit and repeats Step “1”. 5. PERFORMANCE EVALUATION Our mechanisms implementations introduced in the following subsections.
are
Dataset: To evaluate our mechanisms, we have considered a dataset that has been released by Google in May 2011. This dataset represents 29 days of status information about a cluster of 11k physical machines that are operated as a single unit. It contains a realistic mixture of workloads, as it was collected from a Cluster of nonhomogeneous machines. This Cluster was composed of three different platforms and a variety of memory/compute ratios (for details see [25]). The platforms are; Type A includes 126 machines, Type B includes about 10K machines, and Type C includes 795 machines with top configuration. We focused on the Type C platform as CPU and memory size measurements are normalized to the configuration of the largest machines. The exact machine configurations, exact numbers of CPU cores and bytes of memory, were normalized to the configuration of the largest machine. In our experiments, we selected two machines from type C with IDs “4802252539” and “4802096983” [7] [25].
The dataset contains percentages of used resources by each task and requests to allocate these resources. We focused only on data that belongs to the usage of resources, which include measurements of CPU usage, memory space usage, and some other measurements. For simplicity, CPU and memory measurements were focused on. These measurements are stored in a table called “resource usage,” which contains a measurement record every 300s for each task [26]. Our experiments were conducted on the total CPU and memory usage of every machine, so the average CPU usages and memory usages of all tasks on every machine at every timestamp were added together to get these total CPU and memory values. As a result, two files were obtained (one file for every machine). Each file contained almost 8300 records of CPU and memory usage measurements and their timestamps (one measurement every five minutes for 29 days). The first two weeks of these measurements were used as a training data in our prediction models, while the last two weeks were used to evaluate our proposed mechanisms. 5.1 The Implementation Environment In our experimental environment, two PCs were used to build a private cloud computing experimental platform, which includes one PC as a Consumer and the other PC as a Producer. The Consumer and the Producer PCs were connected together using a 10/100Mbps Desktop Switch. Each PC had Ubuntu 11.10 with Linux kernel 3.0 and the open source OpenNebula2 [27], which was used as a cloud computing platform. The Consumer PC was equipped with a Genuine Intel(R) Core(TM) 2 Duo CPU 2.0 GHz with 2.0 GB memory, and the Producer PC was equipped with a Genuine Intel(R) Core(TM) Duo CPU 1.83 2
OpenNebula.org is an open-source project developing the industry standard solution for building and managing virtualized enterprise data centers and IaaS clouds.
8
GHz with 1.0 GB memory. The monitoring programs for the proposed mechanisms were written using C/C++. Also, multithreading and socket programming with C/C++ were used. There are many resource parameters, which can be used to measure the resource status, such as CPU, physical memory, virtual memory, disk space and network equipment data. To simplify the experiments, only two of these resource parameters, i.e. the CPU load percentage and Memory usage percentage, are used to evaluate the performance of our proposed mechanisms. The transition matrix T and the rate matrix R for our prediction models were driven from the first two weeks of the Google dataset of the selected two machines. The last two weeks of the dataset were used to evaluate the performance of our mechanisms. According to Figure 2 and Figure 3, which show the patterns of CPU and memory usage through the last two weeks of collected data, the memory usage of the two machines was more stable and less than that of the CPU. This is due to killing tasks when the memory request exceeded its limit. But, this constraint is omitted in case of CPU usage, whereby tasks can use much more CPU than that the requested CPU [17]. 5.2 Evaluation Metrics To evaluate the performance of our mechanisms, four metrics were used. The first metric was used to evaluate the accuracy of the generated measurements from each mechanism, at the Consumer, for only one resource (CPU or Memory). This metric is the Standard Deviation (0 ≤ SD ≤ 1) between the monitored and actual values. A high value of SD indicates a high monitoring error and vice versa. When SD equals zero, the monitored data exactly matches the actual data and the monitoring accuracy is 100%. In order to evaluate the accuracy of the generated measurements from each mechanism for two resources (CPU and Memory), we constructed a new metric called avg_SD, where its value is the average SD of CPU and Memory.
Figure 2: Actual measurements of last two weeks of the first machine.
Figure 3: Actual measurements of last two weeks of the second machine.
The third metric was used to evaluate the communication overhead caused by each mechanism. This metric was measured based on the total number of updates through the network. The fourth metric was used to evaluate the computation time taken by our mechanisms to predict one value. 5.3 Experimental Results Three groups of experiments were conducted in our environment. The first group was used to evaluate the accuracy level and the communication overhead level of the proposed mechanisms. Also, it was used to determine the main “knobs and dials” of these performance levels. Each experiment in this group was conducted 40 times with different random seed values for the random variables. The error bars represent 99% confidence intervals. This group was conducted on the data file of the first machine.
9
The second experiment group was conducted to examine how to calculate the knobs and dials identified in the first group. In other words, this group was used to build equations that enable us to select the appropriate values of these knobs and dials in order to achieve the needed performance. These equations are empirical only on the given dataset. Also, this group of experiments studied the effect of extending the monitored resources to include Memory besides the CPU resource on the accuracy and network overhead of our monitoring mechanisms. The second group was conducted on the data file of the second machine. Foster and et al [28] has concluded that clouds and Grids share a lot of commonalities in their vision (e.g. architecture and technology), so most mechanisms for monitoring Grid resources have been used for monitoring cloud resources. Therefore, GRIR and MTI are Grid monitoring mechanisms, which try to minimize unnecessary and useless updating massages, and maximize the accuracy of monitoring information. The third group was used to compare the performance of the mechanism that exhibited better performance in the first group with the GRIR mechanism [8] and MTI mechanism [6] in which have been deployed to monitor cloud resources. Some experiments of the third group were conducted on the data file of the first machine and the others were conducted on the data file of the second machine. In these three groups, the number of network updates and SD were used to represent the network overhead and the accuracy level, respectively. 1)First Group Results According to the given dataset, the results of the first group of experiments showed that CTMC-based and DTMC-based mechanisms performed well in monitoring resources. As
shown in Figure 4, the worst case of both mechanisms caused accuracy of about 80% (SD ≈ 20% at ETD ≥50%). Also, SD of both mechanisms was almost less than 40% of the ETD value. The accuracy of CTMC-based mechanism outperformed the accuracy of DTMCbased mechanism, whereby the worst case of the former produced accuracy and communication overhead better than the latter by about 5% and 44%, respectively. According to Figure 5, the communication overhead of both mechanisms decreased significantly to become almost nonexistent. At ETD value of 60% for CTMC-based mechanism and 90% for DTMC-based, the network consumption was almost zero and we got accuracy values of 84% and 79% for the CTMCbased mechanism and DTMC-based mechanism, respectively. This result does not imply that prediction alone is enough or can be enough in general. Although prediction was sufficient at this level of accuracy (no network messages were needed), there was still a need to send actual values over the network for achieving higher accuracy levels. Forcing these network messages to be sent can be achieved by lowering the ETD value. According to Figure 4 and Figure 5, CTMCbased mechanism was more suitable than DTMCbased mechanism in monitoring resources in large distributed environments, in which the former achieved higher accuracy with less communication overhead at the same ETD value. It was seen in the implementation results that taking into account the time spent in each resource state before transition to another state explains the superiority of the CTMC-based mechanism on the DTMC-based mechanism in monitoring cloud resources; DTMC prediction is based only on transitions among states (transition matrix), whereas CTMC prediction takes into account the time spent in each state before making a transition (rate matrix).
10
Figure 4: The accuracy of CTMC-based and DTMCbased Mechanisms at different ETD values. Error bars are 99% confidence intervals.
According to Figure 6 and Figure 7, the communication overhead of both our mechanisms decreased and the accuracy slightly increased by increasing the number of states up to 80 states. For example, at ETD = 20% and K=10 states, SD of CTMC-based mechanism was 8% with ~10% of the communication overhead (423 updates out of a maximum of 4200 actual readings). But, this overhead and SD were improved to become less than 6% (243 updates / 4200) and less than 7.7%, respectively by increasing the number of states to 80 states. This effect was clearly remarkable in case of CTMC-based mechanism. Although the computation time was not more than 100 microseconds, it was affected by the number of states (see Figure 8).The computation time increased by increasing the number of states due to increasing the size of the transition and rate matrices and in turn increasing the number of operations to predict the next state. Increasing the number of states increases the number of centroids and quantization levels. Therefore, the quantization error (the difference between the centroid and the actual value) was decreased by increasing the number of states and in turn the number of updates through the network was decreased and the accuracy was increased. This explains the decrease and the increase of the network overhead and the accuracy, respectively by increasing the number of states as shown in Figure 6 and Figure 7.
Figure 5: The effect of CTMC-based and DTMCbased mechanisms on the communication overhead at different ETD values. Error bars are 99% confidence intervals.
Updates (Corrections) are pushed from monitored resources, when the predicted values deviate from the correct path with a value that exceeds the ETD value. This deviation would decrease the accuracy of our proposed mechanisms, but the corrections adjusted this accuracy degradation when it occurred. This explains the slight increase of our proposed mechanisms’ accuracy as shown in Figure 6. Therefore, the main controller of the accuracy is the ETD value. According to Figures; 4, 5, 6, and 7, the CTMC-based mechanism outperformed the DTMC-based mechanism by considerable amount in the accuracy and the network overhead. So, the experiments of the second and the third group were conducted only on the CTMC-based mechanism. The results of the first group of experiments showed that there were two key controllers, which affected the performance (accuracy, communication overhead, and computation time) of both mechanisms. The first controller was the ETD value and the second was the number of states (K). Hence, the desired performance is achieved by selecting proper ETD and K values. The following subsection will discuss how to calculate these proper values.
11
Figure 6: The effect of number of states on the accuracy of CTMC-based and DTMC-based mechanisms. ETD = 20 and error bars are 99% confidence intervals.
Figure 7: The effect of number of states on the communication overhead of CTMC-based and DTMC-based mechanisms. ETD = 20 and error bars are 99% confidence intervals.
2)Second Group Results The second group of experiments showed that the network overhead was decreased exponentially by increasing the ETD value. According to Figure 9, the number of updates, caused by monitoring only one resource (CPU), pushed to the Consumer is calculated according to the following equation:
Figure 8: The effect of number of states on the time elapsed by both proposed mechanisms to predict one value where error bars are 99% confidence intervals.
Figure 9: Relation between ETD and the number of updates to Consumer caused by monitoring only one resource (CPU) of the second machine. K = 40 as a balance value.
According to Figure 10, the number of updates, caused by monitoring two resources (Memory and CPU), pushed to the Consumer is calculated according to the following equation:
By comparing Equation 1 and Equation 2, it was found that extending the monitored resources to include monitoring the memory resource beside the CPU resource increases network overhead by , where .
Where
In this experiment,
3
3 The experiments in this study were conducted on the last two weeks (1,252,000 seconds) of the Google dataset, which had one measurement each 300 seconds. So, there were about 4173 measurements.
Each monitored resource has its own prediction model. In other words, if memory resource is stable at time t, CPU resource may be unstable (pushing updates to the Consumer) at this time (see Figure 2 and Figure 3). So, when both resources are unstable at the same time, more updates will be pushed to the Consumer.
12
This explains the increase of the network overhead by increasing the number of monitored resources.
Figure 10: Relation between ETD and number of updates to Consumer caused by monitoring two resources (CPU and Memory). K = 40 as a balance value.
According to Figure 11 and Figure 12, it was found that the accuracy was degraded linearly by increasing ETD value. According to Figure 11, SD caused by monitoring only the CPU resource affected ETD as calculated according to the following equation:
where a1 = 39×10-4 and b1 = 3875×10-4. According to Figure 12, SD caused by monitoring two resources (CPU and memory) affected ETD as calculated according to the following equation:
where a2 = 2×10-4 and b2 = 3863×10-4. By matching the equations; 3 and 4, it was found that extending the monitored resources to include monitoring the memory resource besides the CPU resource decreased the avg_SD by only .
generated graph. This explains the increase of both accuracy and network overhead by increasing the number of monitored resources.
Figure 11: Relation between ETD and SD caused by monitoring only one resource (CPU). K = 40 as a balance value.
Figure 12: Relation between ETD and SD caused by monitoring two resources (CPU and Memory). K = 40 as a balance value.
According to linear regression in Figure 13 and the given dataset, increasing number of states (K), on average, number of updates caused by monitoring one resource and two resources was decreased by 3.37 and 4.73 updates, respectively, per state increase. Therefore, extending the monitored resources to include memory beside CPU increases the network overhead by 40%. Equations of this group of experiments enable us to control the performance of our proposed CTMC-based mechanism based on the desired network overhead and accuracy level. 3)Third Group Results
More monitored resources means more corrections to the Consumer. Also, when one resource is unstable, state of all resources (including stable resources) will be pushed to the Consumer and in turn more real points on the
The results of the third group of experiments showed that the performance of CTMC-based mechanism outperformed GRIR and MTI mechanisms. As shown in Figure 14, CTMC-
13
based mechanism achieved the same accuracy level as GRIR and MTI mechanisms with communication overhead (caused by monitoring the CPU only) that ranged from 7% to 28% and from 30% to 67% of the network overhead caused by GRIR mechanism and MTI mechanism, respectively. For example, the GRIR, MTI, and CTMC-based mechanisms achieved at least 90% accuracy level with network overhead 556, 264, and 121 updates, respectively.
network (634 μs). This relative superiority of our proposed mechanism came at the cost of the computation time, however this cost is acceptable.
Figure 14: The relation between the accuracy and communication overhead, which was produced by monitoring the CPU resource of the first machine, of our CTMC-based, MTI, and GRIR Mechanisms. K = 40 for CTMC. Figure 13: Effect of increasing number of states K
and number of monitored resources on the network overhead. According to Figure 15, the performance of CTMC-based mechanism outperformed the GRIR mechanism, which does not employ prediction in monitoring resources, and the MTI mechanism, which employs perdition. MTI mechanism was implemented at K= 3 and at different job processing times (10–360s). As shown in Figure 15, the CTMC-based mechanism achieved the same accuracy level as the GRIR and the MTI mechanisms but with much less communication overhead (caused by monitoring the two resources, CPU and Memory, of the second machine) that ranged from 0.85% to 1.3% and from 2.6% to 4.9% of the network overhead by the GRIR and the MTI, respectively. As shown in Table 1, CTMC-based achieved the highest computation time (29.1 μs). But, this time is still small compared to the time elapsed to send only one update message through the
According to results in Figures 14 and 15, the prediction-based monitoring mechanisms, such as our proposed and MTI mechanisms, outperformed the mechanisms that do not employ prediction, like GRIR, in both cases of monitoring one and two resources. According to Figure 15, the accuracy of MTI, on average, ranged from 83% (SD=17%) to 86% (SD=14%), while the CTMC-based mechanism achieved the same level of network overhead with accuracy that ranged from 91% to 94% (it is not shown in this figure, but it is conducted from Figures 4 and 5). The main reasons for the superiority of the CTMC-based are; (1) using more states than the MTI, which uses only three states. (2) In MTI, fixed values for these states are used in counter to state values in our mechanism, which were obtained by data analysis using kmeans clustering algorithm. (3) Our mechanism is based on CTMC prediction, where its superiority is proved on the DTMC that is used by the MTI mechanism. Our proposed CTMC-based mechanism can be considered adaptive and intelligent. It has
14
employed most of properties mentioned in [4] that make it suitable for monitoring cloud resources. Producer pushes updates to the Consumer when the difference between actual and predicted monitored values exceeds an error tolerance degree, to adjust the prediction path at the Consumer. This shows the adaptability property (reacting quickly to load changes) and the autonomicity property (reacting to changes and performance degradation automatically) of the CTMC-based mechanism. Therefore, the monitoring data of the Producer are available on the Consumer at any time. This shows the timeliness and availability properties of the CTMC-based mechanism. Also, the monitoring data at the Consumer are accurate enough to be used for making decisions.
Figure 15: The relation between the accuracy and communication overhead, which was produced from monitoring the CPU and Memory resources of the second machine, of CTMC-based, GRIR, and MTI mechanisms, where K = 40 for CTMC.
Monitoring Mechanism
Computation Time(μs)
CTMC-based 29.083 MTI 1.517 GRIR 0.049 Table 1: Comparison among the three mechanisms in terms of the computation time.
Finally, we note that although we tried our best to use a realistic dataset captured from a realworld cloud setting, we cannot claim that our results are general; they are indeed tied to the
used dataset. However, we believe that the insights gained from the results can be generalized, and we leave the generalization effort as a subject of future work. 6. CONCLUSION AND FUTURE WORK We have developed two monitoring mechanisms. The first mechanism was based on a CTMC model and the second was based on a DTMC model. The results showed that both mechanisms were accurate enough to be used for monitoring cloud resources with a low communication overhead. The main reason behind these good results was the corrections that were pushed from the Producers. These corrections guaranteed the right prediction path at the Consumer and the stable accuracy of both mechanisms with different numbers of states. Also, the results showed that the performance of CTMC-based mechanism outperformed the DTMC-based mechanism by at least 5% in accuracy and 17% to 44% in communication overhead. Network overhead and accuracy of our both mechanisms were decreased and slightly increased, respectively by increasing the number of states of the Markov model. However, this effect was clearly remarkable in case of CTMCbased mechanism. Our CTMC-based mechanism outperformed the GRIR monitoring mechanism, which does not employ prediction, and the MTI mechanism that employs DTMC prediction. Our proposed adaptive intelligent CTMC-based mechanism has employed most of the properties that make it suitable for monitoring cloud resources. The employed properties are accuracy, adaptability, autonomicity, timeliness, nonintrusiveness (there is no polling in our proposed mechanisms), and availability [4]. Based on the implementation results of our proposed monitoring mechanisms and the importance of the monitoring process in recovering many problems in the cloud computing environments, there are several ideas
15
that could be addressed in the future work. These ideas are; (1) the accuracy and the performance of our proposed CTMC-based mechanism need to be studied by considering larger datasets (e.g. collected over one year), (2) the effect of developing Markov models by considering different periods in a day on the performance of the CTMC-based mechanism need to be studied, (3) studying the effect of different clustering algorithms on the performance of the CTMCbased mechanism. (4) Accuracy, communication overhead, and computation time of our mechanism depend on the ETD and K values, so determining their optimal values is needed. (5) Finally, our monitoring mechanism needs to be evaluated in a large-scale cloud computing environment. References [1] R. Buyya, Cloud Computing: Principles and Paradigms, R. Buyya, J. Broberg, and A. Goscinsk, Eds.: John Wiley & Sons, Inc, 2011. [2] Fang-fang Han et al., "Virtual Resource Monitoring in Cloud Computing," Journal of Shanghai University, vol. 15, no. 5, pp. 381-385, 2011. [3] B. S. Ghio, "Project of a SDP prototype for Public Administrations and private networks," Faculty of Mathematics, Physics and Natural Sciences, University of Genoa, Master of Science in Information Technology 2012. [4] A. Botta, W. Donato, & A. Pescapè G. Aceto, "Cloud monitoring: A survey," Computer Networks, vol. 57, no. 9, pp. 2093–2115, 2013. [5] J. Brandt, A. Gentile, J. Mayo, and P. Pebay, "Resource Monitoring and Management with OVIS to Enable HPC in Cloud Computing Environments," in 23rd IEEE International Parallel & Distributed Processing Symposium (5th Workshop on System Management Techniques, Processes, and Services), Rome, Italy, 2009, pp. 1-8. [6] J. Park, K. Chung, E. Lee, Y. Jeong, and H. Yu, "Monitoring Service Using Markov Chain Model in Mobile Grid Environment," in 5th International Conference on Advances in Grid and Pervasive Computing, (GPC 2010), Hualien, Taiwan, 2010, pp. 193-203. [7] John Wilkes and Charles Reiss. (2011) Details of the ClusterData-2011-1 trace. [Online]. https://code.google.com/p/googleclusterdata/wiki/Clus terData2011_1 [8] W. Chung and R. Chang, "A New Mechanism for
Resource Monitoring in Grid Computing," Future Generation Computer Systems (FGCS), no. 25, pp. 17, 2009. [9] George F. Luger, Artificial Intelligence: Structures and Strategies for Complex Problem Solving, 6th ed.: Addison-Wesley, 2008. [10] Ilya Molchanov, Stochastic Processes I Discrete time.: University of Bern, 2011. [11] M. K. Haider, A. K. Ismail, and I. A. Qazi, "Markovian Models for Electrical Load Prediction in Smart Buildings," in the 19th International Conference on Neural Information Processing (ICONIP), Doha, Qatar, 2012, pp. 632–639. [12] Charles M. Grinstead and J. Laurie Snell, Introduction to Probability, 2nd ed.: American Mathematical Society, 1997. [13] O. Ardakanian, S. Keshav, and C. Rosenberg, "Markovian Models for Home Electricity Consumption," in GreenNets '11 Proceedings of the 2nd ACM SIGCOMM workshop on Green networking, New York, 2011, pp. 31-36. [14] David F. Anderson, Introduction to Stochastic Processes with Applications in the Biosciences. Madison: University of Wisconsin, 2011. [15] H. Huang and L. Wang, "P&P: A Combined PushPull Model for Resource Monitoring in Cloud Computing Environment," in 3rd IEEE International Conference on Cloud Computing, Miami, 2010, pp. 260 – 267. [16] E. Galstad. (2011) NSCA - Nagios Service Check Acceptor. [Online]. http://nagios.sourceforge.net/download/contrib/docum entation/misc/NSCA_Setup.pdf [17] Shirlei Aparecida de Chaves, Rafael Brundo Uriarte, and Carlos Becker Westphall, "Toward an Architecture for Monitoring Private Clouds," IEEE Communications Magazine, vol. 49, no. 12, pp. 130– 137, 2011. [18] S. Claymana et al., "Monitoring service clouds in the future internet," in Towards the Future Internet, G. Tselentis et al., Eds.: IOS Press, 2010, pp. 115-126. [19] E. Galstad. (2007) Nagios NRPE Documentation. [Online]. http://nagios.sourceforge.net/docs/nrpe/NRPE.pdf [20] Blair Hester and Marie McGarry. (2012) VMware. [Online]. http://hyperic-hq.sourceforge.net/ [21] The-Ganglia-Project. (2012) Ganglia Monitoring System. [Online]. http://www.ganglia.info [22] I. Legrand et al., "MonALISA: An agent based, dynamic service system to monitor, control and optimize distributed systems," Computer Physics Communications, vol. 180, no. 12, pp. 2472–2498, 2009. [23] M. M. Al-Sayed, S. Khattab, and F. Omara, "Resource Monitoring Algorithms Evaluation for Cloud
16 Environment," International Journal of Computer Science & Security (IJCSS), vol. 7, no. 5, pp. 159-174, 2013. [24] The-VLFeat-Team. (2014) VLFeat. [Online]. http://www.vlfeat.org/api/kmeans-fundamentals.html [25] Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H. Katz, and Michael A. Kozuch, "Towards understanding heterogeneous clouds at scale: Google trace analysis," Intel Science & Technology Center for Cloud Computing, 2012. [26] Charles Reiss and John Wilkes, Google cluster-usage traces: format + schema, 2nd ed.: Google Inc., 2011. [27] (2014, December) Opennebula. [Online]. http://opennebula.org/about [28] Y. Zhao, I. Raicu, & S. Lu. I. Foster, "Cloud computing and Grid computing 360-degree compared," in Grid Computing Environments Workshop, Austin, TX, 2008, pp. 1-10. [29] John B. Fraleigh and Raymond A. Beauregard, Linear Algebra, 3rd ed. United Kingdom: Addison-Wesley, 1994.
*Author Biography & Photograph
Mustafa AL-Sayed is a teacher assistant in the Faculty of Computers and Information, Department of Computer Science at Minia University, Egypt. He is currently working on Monitoring resources in Cloud Environments. His research interests include distributed computing, cloud service discovery, and data mining.