Self-adaptive cloud monitoring with online anomaly detection

Self-adaptive cloud monitoring with online anomaly detection

Accepted Manuscript Self-adaptive cloud monitoring with online anomaly detection Tao Wang, Jiwei Xu, Wenbo Zhang, Zeyu Gu, Hua Zhong PII: DOI: Refere...

1MB Sizes 4 Downloads 177 Views

Accepted Manuscript Self-adaptive cloud monitoring with online anomaly detection Tao Wang, Jiwei Xu, Wenbo Zhang, Zeyu Gu, Hua Zhong

PII: DOI: Reference:

S0167-739X(17)30376-X https://doi.org/10.1016/j.future.2017.09.067 FUTURE 3723

To appear in:

Future Generation Computer Systems

Received date : 9 March 2017 Revised date : 29 August 2017 Accepted date : 25 September 2017 Please cite this article as: T. Wang, J. Xu, W. Zhang, Z. Gu, H. Zhong, Self-adaptive cloud monitoring with online anomaly detection, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.067 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Self-Adaptive Cloud Anomaly Detection

Monitoring

with

Online

Tao Wang1, Jiwei Xu1, Wenbo Zhang1*, Zeyu Gu2, and Hua Zhong1 1. Institute of Software, Chinese Academy of Sciences, Beijing 100190, China 2. SRE(Site Reliability Engineering) Group, Baidu Inc., Beijing 100085, China *Corresponding author: [email protected] Abstract—Monitoring is the key to guarantee the reliability of cloud computing systems. By analyzing monitoring data, administrators can understand systems’ statuses to detect, diagnose and solve problems. However, due to the enormous scale and complex structure of cloud computing, a monitoring system should collect, transfer, store and process a large amount of monitoring data, which brings a significant performance overhead and increases the difficulty of analyzing useful information. To address the above issue, this paper proposes a self-adaptive monitoring approach for cloud computing systems. First, we conduct correlation analysis between different metrics, and monitor selected important ones which represent the others and reflect the running status of a system. Second, we characterize the running status with Principal Component Analysis (PCA), estimate the anomaly degree, and predict the possibility of faults. Finally, we dynamically adjust the monitoring period based on the estimated anomaly degree and a reliability model. To evaluate our proposal, we have applied the approach in our open-source TPC-W benchmark Bench4Q deployed in our real cloud computing platform OnceCloud. The experimental results demonstrate that our approach can adapt to dynamic workloads, accurately estimate the anomaly degree, and automatically adjust monitoring periods. Thus, the approach can effectively improve the accuracy and timeliness of anomaly detection in an abnormal status, and efficiently lower the monitoring overhead in a normal status. Key words—Cloud computing; Anomaly detection; Adaptive monitoring; Correlation analysis

1 INTRODUCTION In recent years, we are witnessing a rapid development of cloud computing. Being applied in various fields, cloud computing has become a hot spot in IT industry. Many large IT companies all over the world have released their cloud platforms as their core businesses (e.g., Amazon EC2 and Microsoft Azure), while other open-source cloud platforms (e.g., Eucalyptus and OpenStack) also contribute to cloud computing. Currently, online services in Internet (e.g., e-commerce, online finance, social network) have become important parts of our lives, and more and more services have been deployed on cloud platforms (e.g., Saleforce CRM and Netflix). However, due to the diversity of deployed services and dynamism of deployment environment, cloud computing systems often come with unexpected faults which result in significant disruption in user experiences and enormous economic losses for companies [1]. For example, the Amazon website was down for about 45 minutes in August 2013, which brought about 5.3 million losses as customers could not make purchases during that period. Furthermore, the delivery models (e.g., IaaS, PaaS, SaaS) are vulnerable to security attacks [2]. Efficient monitoring is the prerequisite to timely detect anomalies and accurately locate root causes [3]. Cloud computing systems are always in large scales and have complex architectures. To track the running status of these systems, a monitoring system always acquires kinds of monitoring data from different layers (e.g., network, hardware, virtual machine, operating system, middleware, application) in lots of distributed nodes. However, collecting, storing and processing a large amount of monitoring data should cause a huge resource overhead, which affects the timeliness of anomaly detection, the accuracy of fault location, and even overall performance. Furthermore, it is difficult to mining valuable information from large-scale data. Thus, many commercial monitoring tools (e.g., Amazon CloudWatch) only support a fixed and relatively long monitoring period (e.g., collecting a data instance every minute) and manually pre-defined collected metrics. Meanwhile, the providers of cloud monitoring services charge their customers according to monitored metrics and monitoring frequency. Monitoring cost is usually proportional to the number of monitored metrics and monitoring frequency, which on average takes up 18% of total running cost for cloud computing systems [4]. Above all, on one hand, administrators and customers want to decrease the number of monitored metrics and decrease the monitoring frequency to reduce their costs; on the other hand, less monitored metrics and lower monitoring frequency result in less monitoring data, making it hard to locate faults efficiently. For example, the lower the monitoring frequency is, the higher the chance that faults may occur just in between two monitoring data instances. Hence, how to select monitored

metrics and adjust monitoring frequency has become a key issue to ensure the efficiency of monitoring systems. Currently, monitoring systems for large-scale data centers mainly focus on optimizing system architectures and transport protocols to lower pressure on network bandwidth brought from transporting monitoring data. Administrators always manually set the monitored metrics and the monitoring frequency according to their own domain knowledge. This approach is limited by the knowledge of administrators. Moreover, kinds of monitored applications in cloud computing are unknown to platform administrators, so it is difficult for administrators to manually define suitable monitoring rules for specific situation. To address the above issues, this paper proposes a self-adaptive monitoring approach for cloud computing systems. First, we conduct correlation analysis between different metrics, and monitor only selected important ones which can represent the others and reflect the running status of the whole system. Second, we use Principal Component Analysis (PCA) to characterize a running status to estimate the anomaly degree and predict the possibility of faults. Finally, we dynamically adjust monitoring periods based on the estimated anomaly degree and a reliability model. To evaluate our proposal, we have applied our approach in our open-source TPC-W benchmark—Bench4Q deployed on our real cloud computing platform—OnceCloud. The experimental results demonstrate that our approach can adapt to dynamic workloads, accurately estimate anomaly degree and automatically adjust monitoring frequency. Thus, our approach can effectively improve the accuracy and timeliness of anomaly detection in an abnormal status, and efficiently lower a monitoring overhead in a normal status. The main contributions of this paper are as follows:  We propose a correlation based method to select key metrics representing other metrics. As a result, we collect less monitoring data to understand a system’s status, and thus reduce the monitoring overhead.  We model correlations between metrics with PCA to characterize a system’s running status. Therefore, by detecting deviation with cosine similarity, our method can automatically detect anomalies without domain knowledge.  We dynamically adjust metrics and periods based on a reliability model. In this way, our method can self-adaptively collect monitoring data to lower a monitoring overhead when the possibility of faults is low, and ensure the timeliness when faults are more likely to happen.  We design a self-adaptive monitoring framework, which can efficiently collect monitoring data from various distributed systems deployed in a cloud computing environment.  We conduct extensive experiments to evaluate our framework. The experimental results demonstrate that the framework can effectively estimate the anomaly significance of typical faults, and efficiently collect monitoring data with a low overhead. The remainder of this paper is organized as follows. Section 2 introduces our monitoring framework; Section 3 presents our algorithms of dynamic monitoring; Section 4 evaluates the effectiveness of our approach with a series of experiments; Section 5 analyzes and compares related works, and Section 6 summarizes this paper.

2 SELF-ADAPTIVE MONITORING FRAMEWORK Monitored metrics and monitoring frequency are two key factors which influence the monitoring overhead and the accuracy and timeliness of anomaly detection. Therefore, we first select important metrics that can represent systems’ running statuses by analyzing the correlations between different metrics with historical data, and then automatically adjust the monitoring period according to estimated anomaly degree. Since there are a lot of resources from different layers in cloud computing systems to be monitored, reducing the number of monitored metrics becomes crucial to improve monitoring efficiency. Thus, we select the most representative metrics reflecting the possibility of system faults, while neglecting those less representative ones to reduce the monitoring overhead. For the adjustment of monitoring periods, when the possibility of faults increases, we reduce the time interval between two adjacent monitoring operations. Thus, we can collect and analyze more monitoring data to keep track of a system’s running status by increasing the monitoring frequency. On the contrary, when faults are less likely to happen, we increase the monitoring period to lower the monitoring overhead. Since faults are relatively rare events during the whole operation of cloud computing systems, the dynamic adjustment of monitoring periods can dramatically reduce the monitoring overhead, which is important for efficient and extensible monitoring in cloud computing. MAPE-K (Monitoring, Analysis, Planning, Execution, Knowledge) [5] is an autonomic computing model consisting several connected autonomic elements, which accomplish the automatic management of computer systems by interacting and cooperating with each other. Based on the

MAPE-K model, we propose a self-adaptive monitoring framework for cloud computing systems. This framework can automatically estimate the anomaly degree, and then adjust monitored metrics and monitoring periods to improve the accuracy and timeliness of anomaly detection in an anomaly status, and lower the monitoring overhead in a normal status. As shown in Fig. 1, the framework contains five autonomic elements as follows. 1.Monitor(M)

2.Analyzer(A)

Historical Data Storage

5.Knowledge Base(K)

Real-Time Data Storage Data Collector

VM m ...VMn VM 2 VM1 VM1 VM 1 APP Host 1

System Status Characterizer

Metric Correlation Graph

Anomaly Significance Estimator

Metric Significance Estimator

3.Planner(P)

4.Executor(E)

Fault Predictor

Period Adjustor

Metric Selector

Metric Adjustor

Container n ...VMn Container VM1 2 VM1 1 Container … Agent

APP

… Agent

Host 2

Host p ...VMn Host VM14 VM1 Host 3 APP

Agent

Managed Resources

Fig. 1 Self-adaptive Monitoring Framework for Cloud Computing Systems

    

Monitor: deploys monitoring agents on physical machines, virtual machines or containers to collect and persistently store monitoring data from different layers. Analyzer: calculates the correlations between different metrics to evaluate the importance of metrics, and then use PCA to compute the eigenvector of monitoring data and quantify the anomaly degree, according to historical monitoring data obtained from the monitor. Planner: selects monitored metrics for the next phase based on the evaluated importance of these metrics. It uses a Poisson process to build a reliable model, predict the possibility of faults based on the anomaly degree, and then adjusts the monitoring period for the next operation. Executor: dynamically adjusts monitored metrics and monitoring periods with agents. Knowledge Base: records characterized workloads and the corresponding eigenvectors from the operation process to describe a normal running status.

3 ALGORITHMS OF DYNAMIC MONITORING WITH ANOMALY ESTIMATION 3.1 Selection of key metrics The metrics of physical resources in cloud computing systems always correlate with each other [6]. Thus, we use a Pearson’s correlation coefficient to measure the correlation degree of every two metrics. The range for a coefficient is from -1 to 1, where 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no correlation between two metrics. We use some metrics to represent other correlated metrics. In this way, we monitor only a few metrics and infer the changes of the rest, which can decrease the number of monitored metrics to reduce the monitoring overhead. As shown in Algorithm 1 in Table 1, we describe the algorithm in detail as follows: (1) Calculate the correlation coefficients between any two metrics. As shown in Table 2, metric x and y are strongly correlated, when

rxy  [1.0, 0.8]  [0.8,1.0]

. (1) (2) Use an n-node undirected graph to describe the correlations between metrics. Each node

represents a metric in the graph, and an undirected edge connects two metrics strongly correlated with each other. Thus, we can use an n×n matrix to present a metric correlation graph Ann. The weight of each node equals to the number of connected nodes (i.e., the number of connected edges). (3) Select the node with the maximum weight, delete the nodes and edges connected with the node, and update the graph and the weight of every left node. (4) Repeat the above operations until there is no node in the graph. TABLE 1 KEY METRICS SELECTION ALGORITHM ALGORITHM 1. KEY METRICS SELECTION INTPUT: monitoring data set DS  { p1 , p2 ,..., pi ,..., pn }, where pi  ( x, y,...), x, y,… represent metrics. OUTPUT: KEY METRIC SET MS. 1. Compute the correlation coefficient between two random metrics x,y: n

rxy   ( x i  x )( y i  y ) i 1

2.

n

n

 ( x  x)  ( y  y ) i

i 1

2

i 1

2

i

n

n

i 1

i 1

,where x   xi , y   yi .

Construct the metric correlation graph Ann: 1, if rij2  [0.64,1.0] A[i ][ j ]   2 0, if rij  [0.64,1.0]

Where 1  i, j  n, i and j are nodes representing metric i and j. n is the number of metrics, A[i][j] is an edge when A[i][j] is equal to one. 3.

Calculate the weight of metric i:

w[i]   j 1 A[i][ j ] . n

4.

WHILE(TRUE) { Select the maximum value in array w: w[a]; IF(w[a]=-1) THEN RETURN N; w[a]=-1; aN ;

FOR (j=a+1;j
Note that the threshold for correlation coefficients directly influences the monitoring overhead and the accuracy of anomaly detection. If the threshold is too high, many metrics would be classified as “not correlated”, which means that these metrics cannot be represented by other metrics, and thus lots of unnecessary metrics will be included in the monitoring list. Thus, the monitoring overhead should increase. On the contrary, if the threshold is too low, some metrics weakly correlated with each other should still be represented by others. Although these metrics represented by others can indicate the clues of faults, these metrics are likely to be ignored as they are incorrectly represented by others. Therefore, some anomalies and symptoms should be neglected, and the problems will not be located accurately due to the lack of sufficient monitoring data. To ensure the effectiveness of anomaly detection and the accuracy of problem location, this paper sacrifices a part of the monitoring

overhead. According to [7],the threshold is set to 0.8, that is we only consider strongly correlated metrics. TABLE 2 CORRELATION DESCRIPTION DEGREE OF CORRELATION Not correlated Normal Weakly correlated Correlated Strongly correlated

CORRELATION COEFFICIENT (R) [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1.0]

R2 [0,0.04) [0.04,0.16) [0.16,0.36) [0.36,0.64) [0.64,1]

3.2 Estimation of anomaly degree TABLE 3 ANOMALY SIGNIFICANCE ESTIMATION ALGORITHM ALGORITHM 2. ANOMALY SIGNIFICANCE ESTIMATION INTPUT: NULL OUTPUT: Anomaly degree in time t: AnoSigt. 1. Collect monitoring data instances: p  ( x, y ,...), Current monitoring data instance t where x, y,… are metrics; Current workload vector is wvt. 2. Characterize a system’s running status: (1) Running state set:ss = {s1,s2,…},where si = (wpi, msi, fdi),wpi is the ith load pattern,msi is the ith key metric set,fdi is corresponding data direction of wpi; (2) Load pattern set:wps = {wp1, wp2, …},where wpi is the ith load pattern ; (3) Sliding window:

SW  {( p , wv ), ( p , wv ),..., ( p , wv ),..., ( p , wv )},

1 1 2 2 i i n n where pi is the ith monitoring data point,wvi is the ith load vector, n is the size of sliding window. 3. Update sliding window: Current monitoring dataset pt, current load vector wvt, use (pt,wvt)to update SW. Recognize the current load pattern:wpt = PR(wvt); Calculate the eigenvector of the dataset in SW:fdt. 4. Estimate anomaly degree: IF (wpt  wps) //check if the current load pattern matches any known load pattern{ Calculate the degree of system exception: AnoSigt = CalAnomaly(fdt, fdi); } ELSE{ According to algorithm 1,calculate the key metrics set under current load pattern: mst wps = wpi  wps; st = (wpt, mst, fdt); ss = st  ss.

}

We use PCA to calculate the eigenvector of monitoring data and represent systems’ running statuses, and then use the deviation between the eigenvector of historical data and that of current data to estimate the anomaly degree. When the degree is high, we shorten the time interval between every two adjacent monitoring operations to keep track of systems’ running statuses, and thus improve the accuracy and timeliness of anomaly detection. Conversely, when the anomaly degree is low, we extend the monitoring interval to lower the overhead. Since faults are relatively rare events during the operations of cloud computing systems, the dynamic adjustment of monitoring periods should dramatically reduce the monitoring overhead. PCA is a multi-variant statistical method which transforms m correlative metrics to k (k
the latest monitoring data contain information about some faults, the eigenvector direction should change after absorbing abnormal data instances. Then we can use the direction deviation to calculate the anomaly degree. As shown in Algorithm 2 in Table 3, we describe the algorithm in detail as follows: (1) Collect monitoring data instances Construct a sliding window with a size of n; collect monitoring data instance X=(x1,x2,…,xm), where m is the number of metrics for each collection, and xi is the value of the ith metric; Store monitoring data instances into a sliding window in a chronological order; and form a matrix Anm with n rows and m columns. (2) Characterize a system’s running status 1) Standardize the data instance set Di={x1i, x2i,…, xni} in the ith column of Anm, so that the mean of Di is zero, and the deviation of Di is one. 2) Calculate the covariance matrix of Anm:   112 ...  12m     A   ...  ...  2   m2 1 ...  mm  

where

 ij2

,

(2)

.

(3)

is the covariance between Di and Dj:  ij2 =  k=1 n

zki zkj

n

3) Calculate the eigenvector u and use it as the direction of the data set. (3) Update the running status 1) When a new monitoring data instance xt comes, to magnify the influence of outliers towards the change of the main direction, we duplicate the sample n×r times where r is the ratio between the number of current sample’s copies and the size of current samples. Then we get the updated matrix: A  A  {xt , xt ,..., xt }

.

(4)

2) Update the mean and covariance of the matrix: 

  rxt 1 r

,

A 

T Q r xt xtT    1 r 1 r

,

Q

AAT n

.

(5)

3) Update the direction of the eigenvector to: ut 

A u

(6)

A u

(4) Estimate anomaly degree: We use a cosine similarity to represent the deviation of two eigenvectors’ directions, and estimate the anomaly degree from the new collected monitoring data. The less cosine similarity, the bigger the deviation which means a more serious anomaly. Here the anomaly degree is described as: AnoSig t  1 

 ut , u  ut u

.

(7)

Dynamic workloads always fluctuate in cloud computing, which reflects in the concurrency number and the access pattern [9]. For concurrency, we use PCA to describe the correlations between metrics, and detect anomalies by analyzing the change of correlations. In this way, the change of concurrency would not lead to the change in correlations, and thus would not affect the accuracy of anomaly detection. For the change of access patterns, our previous works have presented an algorithm for recognizing workload patterns with online incremental clustering, which learns and matches access patterns during the running process [10]. Since the change of access patterns should cause the change of metric correlations, and thus lead to the change in selected key metrics [11], we use this approach to identify access patterns in this paper. Meanwhile, since workloads are often in stability and periodicity, Algorithm 2 first identifies the access pattern in the current period, monitors the corresponding key metrics in a specific access pattern, and then analyzes it to detect anomalies. Note that we use PCA to calculate the eigenvector of collected monitoring data instances. The

obtained eigenvector presents the correlations between the metrics of monitoring data instances. We detect a fault, when the correlations between metrics change. Thus, we can detect a fault by detecting the change of the direction of an eigenvector. The cosine similarity calculates the included angle of two eigenvectors’ directions to quantify the change of correlations between metrics in the scope of [0, 1]. However, other methods (e.g., Euclidean Distance, Manhattan Distance, Chebyshev Distance) of calculating similarities calculate the distance between two

vectors, but cannot calculate the change of a direction. For computation complexity, we use PCA to calculate the direction of an eigenvector in the initialization period. The computation complexity of a PCA algorithm is O(n3)[12], where n is the number of metrics. In the online analysis period, we update the direction with that of the current eigenvector, and thus the computation complexity is O(n2). Since the number of metrics is often about ten after the selection of monitored metrics, the computation complexity is acceptable and suitable for online analysis. 3.3 Dynamic adjustment for monitoring periods Since a system’s running status is changing, some unexpected runtime factors (e.g., resource competitions in a multithreaded program) may trigger random faults. So the fault occurrence, which is a random event related to the running environment, usually conforms to a Poisson process [13]. Therefore, we use an exponential distribution to model reliability and then predict the time of fault occurrence. We classify faults as two categories: transient faults and cumulative faults [14]. The former faults are usually triggered by some accidental events. For instance, some spin-lock faults are caused by the competitions of shared resources between multiple threads/processes, which result in circular wait or deadlock and manifest as a dramatic rise in CPU utilization; while the latter faults often result from the accumulation of small errors and exceptions. Take a memory leak fault as an example, a thread creates an metric occupying some memory in an invocation, but fails to release these memory metrics, so the total occupied memory increases gradually until there is no enough memory for some programs to run. Poisson process is a classic anomaly detection model for software reliability. It predicts the next possible time of faults according to historical faults [13]. Thus, we use the idea by replacing historical fault events with anomaly degree to predict the time of the next fault. In this way, when a transient fault happens, we can take instant reaction by minimizing the time interval between two adjacent monitoring actions. For cumulative faults, we can gradually adjust the time interval according to anomaly degree. In a Poisson process, a random variable N represents the number of faults occurred in x seconds. If the frequency of faults is λ times per second, N is consistent with the Poisson distribution with a mean of λx: (8) P (X>x) = P (N=0) = e , x≥0. The cumulative distribution function of X: (9) F (x) = P (X≤x) = 1- e , x≥0, where x is an exponential random variable with a parameter λ, which represents the time interval between continuous faults in a Poisson distribution; λ is the average number of faults per unit time in a Poisson process. Note that the possibility of a certain number of faults in a specific time interval is only related to the length of time interval. Therefore, the selection of starting time is irrelevant to the estimated time of faults. We define the possibility of a fault: F(t)=w, and then we can estimate the time interval between the current and the next faults: t=-ln(1-w)/λ, x≥0. So, we adjust the monitoring period as: -λx

-λx

 T , 0  st    1  T  ln( ) /   e,   st    1  st  T ,   st  1  ,

(10)

where Tβ is the minimum monitoring period, Tα is the maximum monitoring period, St is the system’s anomaly degree in time t, and  and e are adjustment parameters. We discuss these parameters in detail as follows: 1) Tβ: has influence on the maximum monitoring overhead. The value should be set according to the target system and administrator’s experience. For example, we assume the request density is about one request every two seconds, and thus if the monitoring period is set as one second, we will get too many invalid “null” results. 2) Tα: determines the timeliness of anomaly detection. For example, if α is set to 60%, faults would happen with a 60% chance between two monitoring actions. Thus, faults have been triggered before we detect anomalies. 3) If St=  , and then T= Tβ; and If St=  , and then T= Tα. Thus, we need to only determine the

values of Tα and Tβ, and can get the values of parameter λ and e. By analyzing the above function, we can see that the higher the anomaly degree is, the lower the monitoring period is. Moreover, as the anomaly degree gets higher, the decline in the monitoring period gets faster. That is, the more serious the anomaly is, the faster the monitoring period drops, which is something we expect to happen. Note that we quantify the anomaly degree instead of simply judging the existence of faults. In most situations, the anomaly degree is a gradually changing process, and therefore a changing monitoring period can reflect the trend of anomaly. Thus, we deal with the cumulative faults well. Meanwhile, as for transient faults, if there are unexpected exceptions happened during the dynamic adjustment of a monitoring period, the maximum threshold can guarantee that the accuracy and timeliness of anomaly detection. TABLE 4 EXPERIMENT ENVIRONMENT HOST NO. 1

HARDWARE

SOFTWARE

Intel(R) Xeon(R) CPU E5645, 2.40GHz, 24 Cores, 24GB RAM

2,3

Intel(R) Core(TM) i5-3470, 3.20GHz, 24 Cores, 16GB RAM

4

Intel(R) Core(TM) i5-3470, 3.20GHz, 24 Cores, 16GB RAM

OS: Centos 7.1 Load Generator: Bench4Q 1.0 Load Balance: Apache 2.4.18 OS: Centos 7.1 Application Server: Tomcat 9.0 Web Application: Bench4Q 1.0 OS: Centos 7.1 Database: MySQL 5.6

4 EVALUATION 4.1 Experimental environment We use our cloud computing platform OnceCloud which is similar to Amazon EC2, and deploys our open-source multi-layer e-commerce benchmark Bench4Q 1 to construct an experimental environment. We apply our approach in this environment, and inject random faults in the target system to validate the effectiveness. Fig. 2 shows that the environment contains four blade servers deployed with docker containers with applications. The configurations of each component are listed in Table 4. TABLE 5 KEY MONITORING METRICS OF DOCKER RESOURCE TYPE CPU Network Memory Disk

METRIC’S INDEX VALUE system_cpu_usage rx_packets tx_packets used pgpgin read write

DESCRIPTION CPU time occupied by system calls Number of packets received Numbder of packets sent Occupied memory of the container Number of bytes transferred from disk or swap to memory Number of bytes read Number of bytes written

AS 1

AS 2 Generator 1 DB 1 AS 3 LB

Host 2

Generator 2 AS 4

DB 2 Host 4

Generator 3

Host 1 Host 3

AS 5

Fig. 2 Experiment Environment

1

Bench4Q. http://www.ow2.org/view/ActivitiesDashboard/Bench4Q

Bench4Q is a benchmark complying with TPC-W specification2, which is a well-known testing benchmark specification for enterprise applications achieving online e-commerce applications. TPC-W has been widely used in industrial corporations (e.g., IBM, Oracle) and academic

communities to study distributed systems. As many existing works [15] [16] [17] to study distributed systems with TPC-W, we also choose it in the evaluation. Due to the lack of an open-source complex well-known cloud applications in production, we have to select such a typical industry-standard e-commerce benchmark. Furthermore, although, we cannot construct a large-scale application in a real-cloud computing platform, our typical distributed application Bench4Q deployed on our cloud computing platform OnceCloud with 11 dockers in 4 hosts can well present a cloud computing environment. TPC-W defines three kinds of workload patterns including browsing, searching and purchasing. Simulated customers visit websites with a series of operations in sessions. Bench4Q adopts a classic three-layer framework of enterprise applications, which includes three widely used open-source software systems, a load balancer Apache3 for distributing requests, a cluster of application servers Tomcat4 for processing business logics, and a database server MySQL5 for operating data. To simulate dynamic workloads in cloud computing, we extend Bench4Q’s to support the dynamic change of original workloads in the concurrency number and access patterns at runtime.

4.2 Selection of key metrics

For collecting monitoring data, we use Zabbix6, a distributed monitoring system, to collect 26 physical metrics in four categories including processor, memory, disk and network. Then, as shown in Table 5, we use correlation analysis to select 7 key metrics from these metrics to represent the others and then conduct online estimation of anomaly degree. Meanwhile, we have made some modification on Zabbix to support the dynamic adjustment of monitoring periods.

2 3 4 5 6

TPC-W. http://www.tpc.org/tpcw Apache. http://httpd.apache.org Tomcat. http://tomcat.apache.org. Mysql. http://www.mysql.com. Zabbix. http://www.zabbix.com

TABLE 6 NETWORK RELATED METRICS

Metric’s index

Description

rx_packets

Number of received packets

tx_packets

Number of sent packets

rx_bytes

Number of received bytes

tx_bytes

Number of sent bytes

rx_errors

Number of errors in receiving packets

tx_errors

Number of errors in sending packets

rx_dropped

Number of dropped packets in receiving

tx_dropped

Number of dropped packets in sending

rx_packets

tx_packets

rx_bytes

tx_bytes tx_errors

rx_errors tx_dropped rx_dropped Fig. 3 Correlations between metrics related with network

For selecting key metrics, due to the limitation of space, we take metrics related with networks for example. Table 6 shows that we collect eight metrics related with networks, and Fig. 3 shows the calculated correlations between these metrics. Then, we use algorithm 1 to select two key metrics (i.e., rx_packets and tx_errors). We use a sliding window to cache monitoring data, and then implement the following two functionalities with the data. First, we calculate the eigenvector of the current metric vectors, and compare it with a standard eigenvector to detect possible faults. Second, we recognize the current workload pattern based on the current workload vectors from the sliding window. Note that the longer the sliding window is, the less apparent faults are, and the worse the timeliness of anomaly detection is, which will lead to more un-alerted faults and un-detected anomalies. On the contrary, the shorter the sliding window is, the more apparent faults are, but noises in monitoring data would also become significant and influence the final result, making more false alarms. To set the suitable length of a sliding window, we have conducted a series of experiments. Due to the limitation of space, we do not provide experimental data in detail. Experimental results show that the approach works the best when the length is between 15 and 25, so we set the length as 20.

4.3 Adaptability of dynamic workloads A dynamic workload can be characterized with concurrency and access pattern [9]. We simulate changing workloads with the load generator of Bench4Q to validate that the approach can adapt to dynamic workloads in cloud computing. In the experiment, the concurrency number gradually increases from zero to 300 by step of two each time. The whole experiment lasts for 750 minutes, each concurrency lasts for 5 minutes, and then we can totally get 150 monitoring data instances collected during the running period. We set the length of a sliding window as 20, use the first 20 data instances as a baseline data set, and online estimate anomaly degree with the following data instances.

The approach dynamically adjusts the monitoring period based on the estimated anomaly degree. In the experiment, to facilitate evaluating adaptability, we set the threshold of anomaly degree as an empirical value 0.2. Any estimated value, which is higher than this threshold, will be determined as an anomaly. Note that if the threshold is too high, no warning will be triggered if anomaly degree rises, and thus the timeliness of anomaly detection will be jeopardized. On the contrary, if the threshold is too low, there will still be warnings even though the systems works well, therefore bringing false alarms. In product environment, the threshold is often set according to administrators’ experience. For example, administrators often set the threshold of CPU utilization as 95%. This threshold is only responsible for the warning of anomalies, but has nothing to do with the adjustment of monitoring periods. (1) Scenario (1): the concurrency number is changing, while the access pattern remains stable. We set the access pattern as “Browsing”, and Fig. 4 (a) shows the experimental result. In the figure, the horizontal axis represents the concurrency number and the vertical axis shows the anomaly degree ranging from 0 to 1. When the access pattern remains stable and the concurrency number changes, all of the normal monitoring instances are all correctly judged as normal. In this situation, the approach well adapts to the dynamic change of the concurrency number. (2) Scenario (2): the access pattern and the concurrency number are both changing. The access pattern changes in the order of “Browsing”, “Shopping” and “Ordering”. The concurrency number increases by 50 each time as the access pattern changes. (i.e., the concurrency numbers in the “browsing” pattern are 1-50 and 151-200, those in “shopping” are 51-100 and 200-250, and those in “ordering” are 101-150 and 251-300). Fig. 4 (b) shows the experimental result that 15 data instances are misjudged as abnormal, which indicates an error rate 5%. Therefore, the approach shows robustness that can well adapt to cloud computing with ever-changing workloads. TABLE 7 INJECTED FAULTS RESOURCE TYPE CPU

FAULT TYPE CPU Hog

Network

Network Congestion

Memory

Memory Leak

Disk

IO Interference

FAULT DESCRIPTION Endless loops or circular wait of process in applications (e.g., spin lock) The applications injected with malicious codes send datagrams, or accept a large amount of requests in network attacks. Allocating heap memory without releasing objects, which gradually exhausts the system memory, and eventually leads to the system crash. Multiple applications simultaneous read/write shared disk intensively.

4.4 Accuracy of anomaly detection

INJECTION OPERATION Injected faults in the application take additional computation operations occupying CPU circles. Using the tc command causes the network to randomly drop 20%, 40%, and 60% of the packets. Injected faults in the application create an object with 1k bytes, and refer it to a static variable, when the faults are triggered. A co-located Docker runs writing disk operations with StressLinux.

Fig. 4 Anomaly Significance Variation with Workload

The required initialization parameters for PCA are as follows: the number of collected monitoring data instances (i.e., the line number in a matrix) is 30, the number of collected metrics (i.e., the column number in a matrix) in a monitoring data instance is 26, the dimension number of a calculated eigenvector is 7 as analyzed in subsection 4.2. As shown in Table 7, we choose four typical faults [18] related with CPU, disk, network and memory to inject with widely used methods [15, 19, 20]. These faults will be triggered if the injected methods are called at runtime. For CPU and disk related faults, we use StressLinux7 that is a tool used for pressure tests and debugging system. For network related faults, we use iPerf8, which is a network performance tester, to occupy network bandwidth with TCP and UDP packages. In this experiment, the concurrency number starts at 0 and incrementally increases by 2 each time till 240. The access pattern randomly changes in “Browsing”, “Shopping” and “Ordering”. The experiment lasts for 600 minutes, each concurrency number lasts for 5 minutes, and we collect 120 data instances during the period. The length of the sliding window is set as 20, the first 20 data instances are used as a baseline to calculate the PCA eigenvector’s main direction, and the following data instances are used for online anomaly detection. We inject one type of faults in each experiment, when the concurrency number is 60. As shown in Fig. 5, there shows a salient rise in anomaly degree as faults are injected in the system. Therefore, the approach is able to detect system anomalies caused by faults, and can reveal the anomaly severity.

7 8

StressLinux, http://www.stresslinux.org/sl iPerf. https://iperf.fr/

Fig. 5 Fault Detection Effectiveness

4.5 Timeliness of anomaly detection The timeliness of anomaly detection is measured with the difference between the time point when the monitoring system detects a fault and the time point that the fault is triggered. We use some experiments with typical faults as listed in Table 7 to validate the timeliness. Each experiment with a fault lasts 120 minutes; in the first stage with a fixed monitoring cycle, the fault is triggered in the 30th minute and stopped in the 55th minute; in the second stage with our dynamic monitoring cycle, the fault is triggered in the 90th minute and stopped in the 115th minute. Fig. 6 shows the comparison results, where the x-axis represents the injected fault numbers, and the y-axis represents the delay time of detecting faults. The experimental results demonstrate that, compared to the method with a fixed monitoring cycle, our method dynamically adjusting the monitoring cycle reduces the alarm delay by 39.1%.

Fig. 6 Comparison of alarm delay

4.6 Case study: memory leak In this section, we take a memory leak fault in Table 7 as an example to evaluate the approach in the effectiveness, timeliness and overhead compared with other approaches using fixed monitoring periods. In the experiment, we set the maximum monitoring period as 300 seconds and the minimum as 10 seconds. The initial anomaly degree is 0, and we start to monitor with the maximum monitoring period (i.e., 300s). We inject faults in the 30th monitoring data instance. Fig. 7 (a) shows the variation of the estimated anomaly degree, and Fig. 7 (b) shows the dynamic adjustment of the monitoring period.

(1) Effectiveness of anomaly detection Fig. 7 (a) shows that, after the fault is injected in the system at the 30th monitoring point, the anomaly degree rises as the memory leak becomes more and more severe, and reaches at the alarm threshold at the 49th monitoring data instance. Note that we define the alarm rule as 10 consecutive points above the threshold. Therefore, we can effectively detect the memory leak fault injected in the system.

(2) Timeliness of anomaly detection Fig. 7 (b) shows that, after the fault is injected, the anomaly degree keeps increasing while the monitoring period keeps decreasing. There are 1959 seconds between fault injection and anomaly detection (i.e., the 31th and the 49th monitoring points, respectively). For traditional monitoring approaches with fixed monitoring period, if the monitoring period is set to 150 seconds (i.e., the half of the maximum monitoring period), it will take 2850 seconds to get the same amount of monitoring data instances. Therefore, the detection time of our approach is 68.7% of that of traditional approaches, when a system is in a faulty status.

(3) Comparison in monitoring overhead Fig. 7 (b) shows that the monitoring period of our approach remains the maximum value that means the number of collected monitoring data instances is the half of that using traditional approaches. Thus, when a system is running in a normal status, the monitoring overhead of our approach is 50% of that of traditional approaches. Furthermore, from the time that the fault is injected (i.e., 31th monitoring point) to the time that the fault is detected (i.e., 49th monitoring point), there are 19 monitoring points (i.e., 1959 seconds). For traditional approaches with a fixed monitoring period, there are 14 monitoring points. Thus, the monitoring overhead of our approach is 135.71% of that of the traditional approaches. Above all, the monitoring overhead of our approach is 50% of that of traditional approaches in a normal status, and is 35.71% higher than that of traditional approaches in a faulty status. Meanwhile, the timeliness of anomaly detection as above analyzed improves by 31.3% at the expense of the monitoring overhead.

Fig. 7 Monitoring Comparison

5 RELATED WORK We propose a self-adaptive monitoring approach which dynamically adjusts monitored metrics and monitoring periods according to the running status of cloud computing systems to reduce the monitoring overhead in a normal status and improve the timeliness of anomaly detection in an abnormal status. We focus on the selection of monitored metrics and the adjustment of monitoring periods. For the former, we select key metrics that can present the running status by analyzing the correlation between metrics from historical monitoring data instances. For the latter, we dynamically adjust monitoring periods based on anomaly degree, which involves the estimation of anomaly degree and the strategies for adjusting monitoring periods. Above all, the related work of this paper includes: the monitoring framework for cloud computing systems, the selection of monitored metrics, estimation of anomaly degree and adjustment of monitoring periods. In this section, we introduce and compare related work in the above four aspects.

5.1 Monitoring framework for cloud computing systems Currently most monitoring frameworks for cloud computing systems are in a master-slave structure, where monitoring data instances are collected by agents deployed on every host or virtual machine to be monitored, and are transferred and stored to a central server. These frameworks includes open-source tools (e.g., Zabbix, Nagios9, Ganglia10) and commercial products (e.g., IBM Tevoli 11 , HP OpenView 12 , LiveAction 13 ). However, collecting and transferring large-scale monitoring data instances inevitably bring a monitoring overhead, which gives the central server huge pressure. Hence, recent studies introduce distributed frameworks. Andreolini et al. [21] present a multi-layer monitoring framework based on distributed monitors, which has the expansibility to 9

Nagios, http://www.nagios.org/ Ganglia, http://ganglia.info/ 11 Tivoli, http://www-03.ibm.com/software/products/en/tivosystautoformult/ 12 OpenView, http://www8.hp.com/us/en/software-solutions/operations-manager-infrastructure-monitoring/ 13 LiveAction, http://liveaction.com/ 10

deal with the online analysis of large-scale systems. Huang et al. [22] analyze the difference of efficiency with “push” and “pull” strategies to acquire monitoring data instances in different scenarios, and propose a requirement-driven monitoring approach based on “push” or “pull” models to lower the monitoring overhead. Koning et al. [23] propose an extensible distributed monitoring system with a peer-to-peer architecture, which can dynamically add or delete a monitored data source to adapt to an ever-changing cloud environment. Some studies focus on specific kinds of resources and requirements in cloud computing. VMDriver [24] collects monitored metrics through Hypervisor and has nothing to do with the type and version of targeted VMs, so can monitor typical kinds of guest OSes. Han et al. [25] use a REST strategy and model kinds of physical resources (e.g., computing, network, storage) in tree structures, and thus users can conveniently get their expected monitoring data. Shao et al. [26] characterize monitoring data instances with different models, so that provide differentiated views for different users according to the focuses of users towards specific resources. mOSAIC[27] provides customized monitoring interfaces for different kinds of cloud applications. MonPaaS[28] differentiates service providers and customers, which allows service providers to check the whole cloud framework and customers to define and monitor their own metrics rent from the cloud platform. Our approach selects monitored key metrics, estimates anomaly degree, and adjusts monitoring periods, so can be applied to these existing frameworks to lower the monitoring overhead and improve the timeliness of anomaly detection.

5.2 Selection of monitored metrics. Monitoring systems always set constraint conditions for monitored metrics and detect anomalies with the violations of constraints. These systems collect monitoring data instances from each node and then conduct unified processing (e.g., calculate a SUM value). To lower the overhead of transferring monitoring data instances, some studies break down a constraint into smaller constraints, and assign them to different nodes. In this way, monitoring systems can estimate the possibility of certain events in a specific period, and adjust partial constraint conditions on these slave nodes rather than on a central server [29-31]. For monitoring networks, some studies usually collect the random samples of monitoring data instances, and check whether the samples violate pre-conditions [32]. However, since the definition of constraint conditions requires detailed knowledge and experiences, it is hard for administrators to set suitable constraints for most situations. Unlike existing approaches, our approach automatically analyzes the correlations between metrics according to historical data instances, and then only monitors key metrics that can reflect the running status of target systems without manual intervention. Thus, administrators do not have to analyze specific target systems, and then define monitored metrics and conditions with their domain knowledge.

5.3 Estimation of anomaly degree Some studies analyze logs and monitoring data to estimate anomaly degree. LogMaster [33] detects anomalies by correlating events from logs and then discovers the violations of correlations. Guan et al. [34] reduce the dimension of monitoring data with Bayesian networks, and then use information entropy and Euclidean distance to detect nodes which have inconsistent behaviors with others. Meanwhile, some studies introduce time sequence models to predict faults by analyzing the variation trend of some metrics to avoid faults in advance. Salfner et al. [35] analyze the trend of monitoring data, and predict the resources utilization accordingly with Wilcox Rank Sum. Oliner et al. [36] use SIGs (Structure-of-Influence Graphs) to infer the influences between interacting components. PCA with a low computation overhead is adapt for online analysis, so it has been widely used in online anomaly detection. Lan et al. [37] extract features with PCA and ICA to reduce data dimension, and then use outlier detection to detect abnormal nodes in clusters. This method uses PCA to reduce dimension for decreasing analyzed monitoring data, while we use PCA to detect an abnormal system status. Xu et al. [38] analyze logs to extract feature vectors, and then use PCA to detect the change of correlations between vector dimensions. This method uses PCA to analyze static logs, while we extend PCA, and uses the eigenvector to online characterize a system’s status. Existing approaches usually employ complicated statistical models with high time complexity, which are usually not suitable for online monitoring and analysis. Meanwhile, these approaches heavily rely on parameters which vary with deployment environment. However, improper parameters should affect the accuracy of models for anomaly detection. We use an incremental PCA to update eigenvectors based on the recent instead of the whole monitoring data instances, and then estimate the anomaly degree based on the cosine similarity instead of a distance. In this way, our approach with low computational complexity performs better in online estimation of anomaly degree.

5.4 Adjustment of monitoring periods There are a few recent studies related with the dynamic adjustment of monitoring periods. MaaS [39], which is the most relevant one, adjusts monitoring periods based on the fluctuation of monitoring metrics. The bigger the fluctuation is, the higher the monitoring frequency is. Meanwhile, the adjustment of monitoring periods learns from the “Slow start, quick drop” strategy in the TCP congestion control. This approach mainly contains two problems as follows. First, the fluctuation metrics is usually caused by the change of workloads (e.g., the number of concurrent users), so the fluctuation does not certainly indicate anomalies. Second, the adjustment of monitoring periods with the TCP congestion control cannot adapt to the change of anomaly degree. We calculate the correlations between all kinds of metrics with PCA to estimate anomaly degree, so can adapt to ever-changing workloads. Meanwhile, we adjust monitoring periods based on the estimation of anomaly degree and the fault prediction model, so the adjustment of monitoring periods can adapt to the running status. Some important recent works focus on privacy and attacks. Gai et al. [40] concentrate on privacy and propose a novel data encryption approach called Dynamic Data Encryption Strategy (D2ES) to selectively encrypt data and use privacy classification methods under timing constraints. D2ES maximizes the privacy protection scope by using a selective encryption strategy within the required execution time requirements. Gai et al. [41] proposes an attack strategy, maximum attacking strategy using spoofing and jamming (MASS-SJ), which utilizes an optimal power distribution to maximize the adversarial effects. Spoofing and jamming attacks are launched in a dynamic manner in order to interfere with the maximum number of signal channels. Gai et al. [42] proposes a Security-Aware Efficient Data Sharing and Transferring (SA-EAST) model designed for securing cloud-based ITS implementations to obtain secure real-time multimedia data sharing and transferring. Luo et al. [43] propose a prediction-based data sensing and fusion scheme to reduce the data transmission and maintain the required coverage level of sensors in WSN while guaranteeing the data confidentiality, which can provide high prediction accuracy, low communication, good scalability, and confidentiality. Luo et al. [44] propose an algorithm that incorporates approximate dynamic programming-based online parameter tuning strategy into the QoS prediction approach. Through online learning and optimization, this approach provides the QoS prediction with automatic parameter tuning capability to achieve the near-optimal performance of QoS prediction. These approaches are excellent for efficiently and effectively real-time data sharing and transferring, but we focus on efficiently monitoring for distributed systems in cloud computing.

6 CONCLUSION Monitoring is one of the most important fundamental services of cloud computing platforms to guarantee the reliability of cloud computing systems. However, frequent collecting, transferring, storing and analyzing a huge amount of monitoring data bring a significant overhead and costs. On the contrary, if the time interval between monitoring actions is too long, the timeliness of anomaly detection will be jeopardized, and less monitoring data will be collected for fault analysis and location. To address the above issue, this paper proposes a self-adaptive monitoring approach with anomaly detection. There are two important aspects in cloud monitoring: monitored metrics and monitoring periods (the time interval between two monitoring actions). Accordingly, we select monitored metrics by evaluating the importance of key metrics, and adjusts monitoring periods based on predicting the possibility of fault occurrence. For the selection of monitored metrics, we calculate the correlations between metrics to construct metric correlation graphs and further select key metrics that can reflect systems’ running statuses. For the adjustment of monitoring periods, we calculate the deviation of eigenvectors using PCA to estimate anomaly degree, and then adjust monitoring periods with a reliability model. In this way, we can automatically adjust the monitoring period to improve the accuracy and timeliness of anomaly detection in an abnormal status and lower the monitoring overhead in a normal status. This approach allows systems to lower the monitoring overhead when the possibility of fault occurrence is low, and ensure the timeliness of anomaly detection when faults are more likely to happen.

ACKNOWLEDGMENT This work was supported in part by National Natural Science Foundation of China under Project 61402450 and 61602454, National Key Technology R&D Program of China under Project 2015BAH55F02, CCF-Venustech Hongyan Research Initiative (CCF-VenustechRP2016007), Alipay Ant Research Initiative (XZ502017000730).

REFERENCES [1] D. A. Patterson, "A simple way to estimate the cost of downtime," in the 16th USENIX Conference on System Administration, Philadelphia,PA, 2002, pp. 185-188: USENIX. [2] S. Iqbal et al., "On cloud security attacks: A taxonomy and intrusion detection and prevention as a service," Journal of Network and Computer Applications, vol. 74, pp. 98-120, 10// 2016. [3] V. Prokhorenko, K.-K. R. Choo, and H. Ashman, "Web application protection techniques: A taxonomy," Journal of Network and Computer Applications, vol. 60, pp. 95-112, 1// 2016. [4] Amazon. (2012). EC2 CloudWatch. Available: http://aws.amazon.com/cloudwatch/ [5] M. C. Huebscher and J. A. McCann, "A survey of autonomic computing—degrees, models, and applications," ACM Comput. Surv., vol. 40, no. 3, pp. 1-28, 2008. [6] G. Jiang, H. Chen, and K. Yoshihira, "Modeling and tracking of transaction flow dynamics for fault detection in complex systems," IEEE Transactions on Dependable and Secure Computing, vol. 3, no. 4, pp. 312-326, 2006 2006. [7] J. D. Evans, Straightforward Statistics for the Behavioral Sciences. Pacific Grove, CA, USA: Brooks/Cole Pub Co, 2005. [8] M. A. Munawar and P. A. S. Ward, "A comparative study of pairwise regression techniques for problem determination," in Conference of the Center for Advanced Studies on Collaborative Research, Toronto, Canada, 2007, pp. 152-166: ACM. [9] T. Wang, J. Wei, W. Zhang, H. Zhong, and T. Huang, "Workload-aware anomaly detection for Web applications," Journal of Systems and Software, vol. 89, no. 0, pp. 19-32, 3// 2014. [10] T. Wang, W. Zhang, J. Wei, H. Zhong, and T. Huang, "FD4C: Automatic Fault Diagnosis Framework for Web Applications in Cloud Computing," IEEE Transactions on Systems, Man and Cybernetics: Systems. [11] A. Williams, M. Arlitt, C. Williamson, and K. Barker, "Web workload characterization: Ten years later," Web content delivery, vol. 2, no. 1, pp. 3-21, 2005. [12] H. Sun, C. W. Sang, and C. G. Liu, "Sparse Principal Components Analysis," Computer Science, 2015. [13] K. Shibata, K. Rinsaka, and T. Dohi, "Metrics-Based Software Reliability Models Using Non-homogeneous Poisson Processes," in 2006 17th International Symposium on Software Reliability Engineering, 2006, pp. 52-61. [14] F. Salfner, M. Lenk, and M. Malek, "A survey of online failure prediction methods," ACM Computer Surveys, vol. 42, no. 3, pp. 1-42, 2010. [15] G. Casale, M. Ningfang, and E. Smirni, "Model-Driven System Capacity Planning under Workload Burstiness," IEEE Transactions on Computers, vol. 59, no. 1, pp. 66-80, 2010. [16] W. Zhou, G. Pierre, and C. Chi-Hung, "CloudTPS: Scalable Transactions for Web Applications in the Cloud," IEEE Transactions on Services Computing, vol. 5, no. 4, pp. 525-539, 2012. [17] N. Janevski and K. Goseva-Popstojanova, "Session Reliability of Web Systems Under Heavy-Tailed Workloads: An Approach based on Design and Analysis of Experiments," IEEE Transactions on Software Engineering, vol. PP, no. 99, pp. 1-22, 2013. [18] S. Pertet and P. Narasimhan, "Causes of failure in web applications," Parallel Data Laboratory, Carnegie Mellon University, CMU-PDL-05-109, 2005. [19] H. Kang, H. Chen, and G. Jiang, "PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems," in the 7th International Conference on Autonomic Computing, 2010, pp. 119-128: ACM. [20] G. Jiang, H. Chen, K. Yoshihira, and A. Saxena, "Ranking the importance of alerts for problem determination in large computer systems," Cluster Computing, vol. 14, no. 3, pp. 213-227, 2011. [21] M. Andreolini, M. Colajanni, and M. Pietri, "A Scalable Architecture for Real-Time Monitoring of Large Information Systems," in Network Cloud Computing and Applications (NCCA), 2012 Second Symposium on, 2012, pp. 143-150. [22] H. Huang and L. Wang, "P&P: A Combined Push-Pull Model for Resource Monitoring in Cloud Computing Environment," in Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on, 2010, pp. 260-267. [23] B. Konig, J. M. A. Calero, and J. Kirschnick, "Elastic monitoring framework for cloud infrastructures," IET Communications, vol. 6, no. 10, pp. 1306-1315, 2012. [24] G. Xiang, H. Jin, D. Zou, X. Zhang, S. Wen, and F. Zhao, "VMDriver: A Driver-Based Monitoring Mechanism for Virtualization," in Reliable Distributed Systems, 2010 29th IEEE Symposium on, 2010, pp. 72-81. [25] H. Han et al., "A RESTful Approach to the Management of Cloud Infrastructure," in Cloud Computing, 2009. CLOUD '09. IEEE International Conference on, 2009, pp. 139-142. [26] J. Shao, H. Wei, Q. Wang, and H. Mei, "A Runtime Model Based Monitoring Approach for Cloud," in Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on, 2010, pp. 313-320. [27] M. Rak et al., "Cloud Application Monitoring: The mOSAIC Approach," in Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on, 2011, pp. 758-763. [28] J. M. A. Calero and J. G. Aguado, "MonPaaS: An Adaptive Monitoring Platformas a Service for Cloud Computing Infrastructures and Services," IEEE Transactions on Services Computing, vol. 8, no. 1, pp. 65-78, 2015. [29] S. Kashyap, J. Ramamirtham, R. Rastogi, and P. Shukla, "Efficient Constraint Monitoring Using Adaptive Thresholds," in Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, 2008, pp. 526-535. [30] M. Dilman and D. Raz, "Efficient reactive monitoring," in INFOCOM 2001. Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE, 2001, vol. 2, pp. 1012-1019 vol.2. [31] I. Sharfman, A. Schuster, and D. Keren, "A geometric approach to monitoring threshold functions over distributed data streams," presented at the Proceedings of the 2006 ACM SIGMOD international conference on Management of data, Chicago, IL, USA, 2006. [32] C. Estan and G. Varghese, "New directions in traffic measurement and accounting," SIGCOMM Comput. Commun. Rev., vol. 32, no. 1, pp. 75-75, 2002. [33] X. Fu, R. Ren, J. Zhan, W. Zhou, Z. Jia, and G. Lu, "LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems," in Reliable Distributed Systems (SRDS), 2012 IEEE 31st Symposium on, 2012, pp. 71-80. [34] Q. Guan, D. Smith, and S. Fu, "Anomaly detection in large-scale coalition clusters for dependability assurance," in

High Performance Computing (HiPC), 2010 International Conference on, 2010, pp. 1-10. [35] F. Salfner, P. Troger, and S. Tschirpke, "Cross-core event monitoring for processor failure prediction," in High Performance Computing & Simulation, 2009. HPCS '09. International Conference on, 2009, pp. 67-73. [36] A. J. Oliner, A. V. Kulkarni, and A. Aiken, "Using correlated surprise to infer shared influence," in Dependable Systems and Networks (DSN), 2010 IEEE/IFIP International Conference on, 2010, pp. 191-200. [37] Z. Lan, Z. Zheng, and Y. Li, "Toward Automated Anomaly Identification in Large-Scale Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 21, no. 2, pp. 174-187, 2010. [38] W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, "Detecting large-scale system problems by mining console logs," presented at the Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, Big Sky, Montana, USA, 2009. [39] S. Meng and L. Liu, "Enhanced Monitoring-as-a-Service for Effective Cloud Management," IEEE Transactions on Computers, vol. 62, no. 9, pp. 1705-1720, 2013. [40] K. Gai, M. Qiu, and H. Zhao, "Privacy-Preserving Data Encryption Strategy for Big Data in Mobile Cloud Computing," IEEE Transactions on Big Data, vol. PP, no. 99, pp. 1-1, 2017. [41] K. Gai, M. Qiu, Z. Ming, H. Zhao, and L. Qiu, "Spoofing-Jamming Attack Strategy Using Optimal Power Distributions in Wireless Smart Grid Networks," IEEE Transactions on Smart Grid, vol. 8, no. 5, pp. 2431-2439, 2017. [42] K. Gai, L. Qiu, M. Chen, H. Zhao, and M. Qiu, "SA-EAST: Security-Aware Efficient Data Transmission for ITS in Mobile Heterogeneous Cloud Computing," ACM Trans. Embed. Comput. Syst., vol. 16, no. 2, pp. 1-22, 2017. [43] X. Luo, D. Zhang, T. Laurence, J. Yang, J. Liu, X. Chang, H. Ning, A kernel machine-based secure data sensing and fusion scheme in wireless sensor networks for the cyber-physical systems, Future Generation Computer Systems, Volume 61, Pages 85-96, 2016. [44] X. Luo, H. Luo, X. Chang, Online optimization of collaborative web service QoS prediction based on approximate dynamic programming, International Journal of Distributed Sensor Networks, id:452492, 2015.

Author Biography Tao Wang received the PhD degree in Computer Software and Theory from the Institute of Software Chinese Academy of Sciences in 2014, and MS degree in Computer Architecture from the University of Electronic Science and Technology of China in 2008. He is an assistant professor in the Institute of Software Chinese Academy of Sciences. His research interests include fault diagnosis, software reliability, and autonomic computing for cloud computing systems. Wenbo Zhang received the PhD degree in Computer Software and Theory from the Institute of Software Chinese Academy of Sciences in 2007, and the MS degree in Computer Science from Shandong University in 2002. He is a professor in the Institute of Software Chinese Academy of Sciences. His research interests include distributed computing, cloud computing and middleware. Zeyu Gu received the Master degree in Computer Software and Theory from the Institute of Software Chinese Academy of Sciences in 2014. He is a senior software engineer in SRE (Site Reliability Engineering) group, Baidu Inc., Beijing, China. His research interests include anomaly detection and cloud monitoring. Hua Zhong received the PhD degree in Computer Software and Theory from the Institute of Software Chinese Academy of Sciences in 1999. He is a professor in the Institute of Software Chinese Academy of Sciences. His research interests include software engineering and distributed computing.

Author Photo Tao Wang

Wenbo Zhang

Zeyu Gu

Hua Zhong

Highlights  Correlation analysis is proposed to select key metrics representing others.  PCA is proposed to characterize running status and predict the possibility of faults.  We dynamically adjust metrics and periods based on a reliability model.  We evaluate the approach on our real cloud platform with case studies.