Accepted Manuscript Performance measurement of data flow processing employing software defined architecture Lei Liu, Yongjian Ren, Lizhen Cui, Yuliang Shi, Qingzhong Li
PII: DOI: Reference:
S0167-739X(17)31718-1 https://doi.org/10.1016/j.future.2017.12.050 FUTURE 3888
To appear in:
Future Generation Computer Systems
Received date : 31 July 2017 Revised date : 25 December 2017 Accepted date : 26 December 2017 Please cite this article as: L. Liu, Y. Ren, L. Cui, Y. Shi, Q. Li, Performance measurement of data flow processing employing software defined architecture, Future Generation Computer Systems (2018), https://doi.org/10.1016/j.future.2017.12.050 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Performance Measurement of Data Flow Processing Employing Software Defined Architecture Lei Liua , Yongjian Rena , Lizhen Cuia , Yuliang Shia , Qingzhong Lia a No.1500
Shunhua Road, Jinan, China
Abstract With the development of information technology, the importance of big data is quickly highlighted. Big data applications show great value to individuals, companies and governments. Recently, researches on the storage and utilization of big data have achieved considerable results. The prosperity of big data applications is a thrust of drawing attention to the system performance such as timeliness, computational and communication resources. Data retransmission caused by the violation of the stringent delay bound may result in the reprocessing of these data, which would have a negative effect on user experience. To fill this gap, a software defined architecture is developed in this work so that the appropriate start point of processing can be found for the data need to be reprocessed. For further improvement of the processing performance, two models are presented to this software defined architecture. In the optimized model, a priority queue is employed to facilitate the processing efficiency. In addition, data flows transmitting through networks exhibit obvious self-similar characteristics. Performance analysis without taking traffic self-similarity into account may lead to unexpected results. In the optimized model, the tightly coupled system makes performance analysis difficult. Therefore, a decomposition approach is employed to divide the coupled system into a group of single server single queue systems. Finally, the developed model is validated through extensive experimental results. Keywords: software defined architecture; performance measurement; data flow processing; queueing system; self-similar. 1. Introduction Big data [1, 2, 3, 4] creates values for business and society, but poses challenges in terms of networking management and analytics [5]. Big data is not just characterized by Volume. It is also depicted by Variety, Velocity, Value, and Veracity. Big data is not just data. It also means new ideas and methods. The value of big data can only be revealed after effective mining and analysis. Big data has high predictive power. In traffic management, business transactions, weather forecasting, national security, mobile services, manufacturing and other areas, big data is playing a significant role on improving efficiency, saving resources and optimizing decision-making [6, 7, 8, 9, 10, 11]. Data flow processing of big data applications usually includes acquisition, preprocessing, storage, analysis and mining. The effective utilization of big data may reduce the operating cost of enterprises and society and improve efficiency. In big data era, data flow processing is a challenging task because of the variety and velocity characteristics of big data [12, 13, 14, 15]. It should be well fault-tolerant, flexible, and extensible [16, 17]. A large number of practical applications put forward higher requirements on big data analysis, such as the timeliness, computational and communication resources [18, 19, 20, 21, 22]. For example, according to an estimate, in 2014 there were 9 billion interconnected things in China and it is expected to reach 24 billion by 2020 [23]. The rapid increase of devices could exhaust current communication resources while the data created by these devices could bring great pressure on computational resources. In the fields of weather forecasting and traffic management, people not only want to make better decisions by taking advantage of big data, but also have strict demands on response time. We note that analysis of big data tends to be divided into phases. When the violation of the stringent delay bound occurs, data may be retransmitted and reprocessed [24]. If fault-tolerant mechanisms are not optimized, there could be some unnecessary duplicates of processing stages [25]. Therefore, in this paper, the software defined architecture in which a controller is employed is developed to optimize the performance of data reprocessing. When data retransmission occurs, the controller should give a proper starting point to avoid redundant processing stages. For example, when handling a business at a bank Preprint submitted to Elsevier
December 25, 2017
counter, if we enter the wrong password, we naturally want to have the opportunity to re-enter the password, instead of returning to the queue. We model this software defined architecture into a queueing system study processing performance. The type of data flow is diverse, which can have a significant impact on performance analysis. In a business context, the data flow may come from the database, so it is relatively stable. While in the transportation system or weather forecasting system, data flows are mostly generated by kinds of sub-network systems consisting of various sensors, and then converged to provide decision support. In this context, data flow may show obvious self-similar characteristic which has been widely observed [26, 27]. In statistics, the parts of the self-similar process present the same statistical properties at different scales. Performance analysis without taking self-similarity into account may lead to unexpected results. Therefore, data flow with self-similar characteristic is employed in our analytical model to evaluate the performance of the software defined architecture. Further, we make an optimization on the software defined architecture. Now, the data that need to be reprocessed faces a new problem. How do these data converge into normal data flows? What is the priority of the data to be reprocessed compared to normal data? In this paper, we consider two choices. The first strategy is that the data to be reprocessed has the same priority as the normal data, so the normal data and the data to be reprocessed will be treated equally. We modeled this processing scheme as the Single Queue Model (SQM). The second solution is that the data to be reprocessed has higher priority than normal data flow. Since they have been dealt ever, it is indicated that the data to be reprocessed arrived at the system earlier than these normal data. We modeled this processing scheme as the Prioritized Service Model (PSM). In order to avoid the data that need to be reprocessed staying in the system too long, the PSM is selected as the recommended model of the developed software defined architecture. The performance comparison between the two processing models is given via our experiments to verify that our goals are achieved. As mentioned above, the PSM is employed to analyze the performance of the software defined architecture. In a data node with a priority queue, two tightly coupled queues result in the difficulty of analysis. Thus, the famous Empty Buffer Approximation theory (EBA) [28] is applied in our work. Using the EBA theory and a decomposition method, the tightly coupled system can be statistically divided into two single-queue single server systems. Then, we can carry out performance analysis using evaluation indexes such as queue length and queuing delay. We conducted a large number of simulation experiments to verify the analysis results. The experimental results show that our analytical model is not only suitable but also accurate. Throughout this paper, we make the following contributions. • In order to improve the efficiency of data flow processing on big data applications, we developed a novel software defined architecture. A software defined controller is employed for the processing of data flows. Data need to be reprocessed when data is retransmitted due to the violation of the stringent delay bounds. The controller will give a token to the data that needs to be reprocessed to avoid redundant processing stages. • According to the developed software defined architecture, we give two performance analysis models, namely SQM and PSM. For the data to be reprocessed, the PSM with the priority queues is more efficient for the demand of the timeliness. Therefore, in this paper, the PSM is used to analyze the performance of the software defined architecture by taking advantage of the famous EBA theory and a decomposition method. • A series of experiment is conducted to compare the performance between SQM and PSM. The comparison results demonstrate the superiority of the PSM. Then, we conducted a large number of simulation experiments to validate our analytical results. The experimental results match very well with the analytical results. The remainder of this paper is organized as follows. Section II describes the preliminaries and related work about this study. First, the software defined architecture is introduced. Then, self-similarity and the analytical models are demonstrated. In Section III, the concepts of the “queueing” and “transmission” behavior are defined to simplify the performance analysis of the software defined architecture. Then, performance analysis between SQM and PSM is presented. After that, performance of the software defined architecture is characterized using PSM. Section IV compares the experimental results between the SQM and the PSM to present the advantage of the PSM. Then, the accuracy and feasibility of the PSM are validated by the comparison between analytical and experimental results. Finally, the paper is concluded in Section V. 2
Controller
Stage 1
Stage 2
Stage 3
Data flow
Data analysis and mining
Figure 1: Data reprocessing with a software defined controller
2. Preliminaries and Related Work 2.1. Software defined architecture The value of big data lies in the results of effective analysis and mining of the massive data. For real-time big data applications, such as weather forecasting and traffic management, the source of data often includes multiple sub-network systems composed of kinds of sensors. Data is gathered from these subsystems and then processed. The violation of the stringent delay bounds may cause data retransmission, which leads to repeat data processing stages. In such a system, the lack of the optimization mechanism on fault tolerance may result in redundant processing. Processing of data flows usually goes through several sub-phases, as shown in Fig. 1. When data need to be reprocessed, we expect the processing can start from an appropriate stage, rather than blindly start from scratch. Therefore, software defined architecture is employed, in which a software defined controller is applied to select the appropriate stage for the beginning of the data reprocessing, as shown in Fig. 1. If data need to be reprocessed, the controller collects the necessary information and a token is given to these data. When the data node reprocesses these data, it could find the appropriate starting point according to these token. Thus, the efficiency of the whole system can be improved to meet the demands of real-time big data applications on response time. 2.2. Self-similar arrival Observation results related to time series may show obvious scaling characteristic, which is widely recognized as “self-similarity”. Some highly credible surveys of network traffic statistics show that data flow in the network often has obvious self-similar characteristics. Therefore, a large number of novel traffic modeling and analysis methods are used to describe this feature. An intuitive description of self-similarity is that the data will show the same statistical characteristics from different scales. For example, the self-similarity of network flows [26, 29] is an important and familiar phenomenon in modern communication network. It is understandable that the quantity of data (e.g. the quantity of packets) can be varying for each minute of a day. The characteristic of self-similarity is that, when we observe the data quantity of each day in a year, the series of data quantity can still be varying rather than becomes smooth. In Leland’s work [27], more details about self-similar characteristic are described. The Hurst parameter [27] is an important parameter which represents the degree of the self-similarity. If the value of the Hurst parameter is 0.5, then a sequence has no self-similar feature. The closer the H is to 1, the higher the degree of self-similarity. Several effective ways of measuring self-similarity have been developed, such as the famous R/S estimator and the VT estimator [30]. Studies on simulated self-similar flow have also achieved many significant results. This is usually a basic tool for further researches of self-similar flow in the context of different applications. The fractional Brownian motion (fBm) process [31] has been proved as a useful model in generating self-similar flow [28, 32]. In this study, the fBm process is used to model the cumulative self-similar arrival inputs of the system. 2.3. Analytical model As Fig. 2(a) shows, a SQM based on queueing system has been established. This model has expressly demonstrated the controller and the data node. They are connected through a feedback path. The mean service time of the controller can be express by T c . The mean service time of the data node can be express by T n . The mean service rates are usually exponentially distributed. Then the capacity of the controller and the data node can be marked by mean 3
controller controller
Cc Cc mP
mP
mP
mP high priority queue
Cn Cn
m
m
m
m
low priority queue
data node
data node
(a) Single Queue Model (SQM)
(b) Prioritized Service model (PSM) Figure 2: Analytical models
service rates with value Cc = E(1/T c ) and Cn = E(1/T n ) respectively. The parameter m is the mean arrival rate of the input data flow which determines the output flow to have a same value of the departure rate. In a data node, if the data have a reprocessing probability with value P, then the traffic arriving at the controller will have a mean arrival rate with parameter mP. However, the model presented in Fig. 2(a) is not very efficient as the data coming from the controller have an equal priority with new coming data flow. Data that need to be reprocessed have lingered longer in the data node than the normal data. So these data should be processed as soon as possible. Therefore, we use a high-priority queue to deal with the data that need to be reprocessed, as shown in Fig. 2(b). Self-similar nature has been proved as a significant phenomenon in networks and rarely been studied in data flow processing. The pattern of the input data flow has an important effect on the analytical model of data flow processing performance. Performance analysis without taking self-similar into account may lead to unexpected results. To fill this gap, in our PSM, we take advantage of the outstanding studies about self-similar nature. Then, a comprehensive analysis about data node performance is presented. Merging and splitting of traffic flow are the most common phenomena in network systems. For two self-similar network traffic flows, if they have a same value of Hurst parameter, then the converged flow will still be of self-similarity and will show the same value of the Hurst parameter [33]. On contrary, if a self-similar flow is divided into multiple sub-flows, then these sub-flows would have the same Hurst parameter values as the father flow has. These properties are of great help in our analysis. The Large Deviation Principle (LDP) [34] is a significant analytical method for performance measurement of a system with self-similar data flow input. Using this method, the probability distribution of queue length of self-similar flows can be derived [35, 36]. For a set of random variables, this method can give the asymptotic upper and asymptotic lower bounds of probability distribution. Relevant studies have shown that the geometric average of asymptotic upper and lower bounds is a good estimator [37]. Consider a single queue single server system which has an exponential distributed service time. During the time period (a, b), the accumulated queue length of an aggregated traffic flow can be expressed as: sup Q(a, b) = {A (a, b) − Cn (b − a)} (1) a6b r where Ar (a, b) is the accumulated amount of the traffic in the time period (a, b). Cn is the processing rate on behalf of the server capability. Further, we can obtain the asymptotic upper and lower bounds of the queueing delay distribution as follows [37]: p Φ L(tˆ) 6 P(Q > x) 6 e−L(tˆ)/2 (2) Φ(·) is the residual distribution function of the standard Gaussian distribution which is usually approximated as: 1 2 1 Φ(x) ≈ p e− 2 x 2 2π(1 + x)
4
(3)
Waiting in data node buffer
Processed by data node
in Controller
Queueing
Waiting in data node
Processed by data node
Transmission
Figure 3: The data flow processing scheme in PSM
Therefore, the difference between the upper and lower bounds of the queue length is just the coefficient of e−L(tˆ)/2 . As the geometric mean has been proven as an effective assessment value, we can obtain the approximate function of the queue length distribution [37]: e−L(tˆ)/2 P(Q > x) ≈ q (4) 2 p 4 2π 1 + L(tˆ) The tˆ is a time instant that minimizes function L(t) [37]: L(t) =
(−x + (Cn − m) t)2 var mt2H
(5)
where var is the variance coefficient. The EBA theory is a well-known method for analyzing queue lengths of low priority buffers in priority queue systems. When the server has sufficient capacity, the queue length of the high priority queue can be ignored, since it is almost an empty buffer. In such a priority system, the high priority data flow appears to be accumulated to the low priority queue. Then data flow in high priority queue can get timely service, while the data flow in the low priority queue needs to wait more time. The EBA theory implies that the total queue length of the whole system consists mostly of the length of the low priority queue. Therefore, the performance of a low priority queue should be approximated using the converging data flow. To do this, we can apply a decomposition method so that the tightly coupled system can be statistically equivalent to two single-queue systems. The validity of the EBA theory has been empirically proven. 3. Performance Evaluation In this section, we will take advantage of the famous LDP and EBA theory. And a decomposition method is used to support the analysis of the proposed PSM. Definitions of “queueing” and “transmission” behaviors are given in order to simplify the analysis of the software defined architecture. After that, the “queueing” performance and the “transmission” improvement is studied. 3.1. Definitions of “queueing” and “transmission” Data which processed in this software defined architecture may have different experiences, as presented in Fig. 3. When arriving at the data node, new arrival data must always be waiting for a period of time in the buffer. Then, if the data are successfully processed, it will leave the data node. We define this waiting behavior as “Queueing”. Instead, if data are not successfully processed in the data node, the necessary information will be sent to the controller. The controller assigns a token to the data after analyzing the information, then these data should wait for another processing chance. To facilitate performance analysis, we define this process as “Transmission”, as Fig. 3 demonstrated.
5
3.2. Comparison between PSM and SQM 3.2.1. Improvement on the transmission delay The data that need to be reprocessed has spent much time in the system than normal data, because they have been processed once. One of the causes that lead to data loss is the violation of the stringent delay bound. If data cost too much time beyond the stringent delay bound, data retransmission may occur. Unlike SQM, PSM utilized a priority queue to reduce the transmission time. Thus, data loss may decrease and we can expect a better performance. In this part, we focus on the difference of transmission delay (Dtr ) between SQM and PSM. As mentioned, a “transmission” contains several parts. Thus, transmission delay (Dtr (M)) consists of communication time (T com ) between controller and data nodes, delay in controller buffer (Dc ), response time (T c ) spend by the controller, and another queueing delay (D0q ) in data node buffer. Therefore, we define Transmission Delay as: Dtr (M) := T com + Dc + T c + D0q (M)
(6)
where the parameter M represents the type of analytical models (i.e. SQM or PSM). As the high priority queue is almost an empty buffer at most time, it is obvious that we have D0q (PS M) < D0q (S QM). Thus, we can obtain that Dtr (PS M) < Dtr (S QM). In SQM, the data from the controller are equally treated with the external data. So we have D0q (S QM) = Dq . Then we define the improvement rate on transmission delay as η: η := E((Dtr (S QM) − Dtr (PS M))/Dtr (S QM)) ≈ E(Dq /(T com + Dc + T c + Dq ))
(7)
3.2.2. Response time of reprocessed data When the mean arrival rate and the probability of data reprocessing are fixed, the distribution of queueing delay (Dq ) is also fixed. Hence, the value of η is determined by the value of Dtr (S QM) especially the value of T c . Since a more powerful controller could lead to a smaller Dc and a smaller T c , this formula indicates that a more powerful controller could obtain a better rate of improvement (i.e. η). The response time of reprocessed data in a data node is much greater than that of normal data. Thus, these data are more sensitive to the stringent delay bound. As Fig. 3 demonstrated, if data does not need to reprocess, the response time can be shown as: RT normal (M) = Dq + T n (8) where T n represents the mean service time of the data node with an average value of 1/Cn . If data need to be reprocessed, the response time can be marked using RT reprocessing (M): RT reprocessing (M) = Dq + T n + Dtr (M) + T n0
(9)
where T n0 represents another time of service time spent by data node and we have E(T n ) = E(T n0 ). Further, we can easily obtain RT reprocessing (PS M) < RT reprocessing (S QM)
(10)
3.3. Performance analysis in PSM 3.3.1. Queueing length in PSM Queue length is one of the most important indexes in network performance analysis. On one hand, memory is invaluable for any network device. When the queue length exceeds the preset threshold, some data must be discarded according to certain policies. On the other hand, queue length always has a direct impact on the response time. Further, it may make a bad effect on user experience. Therefore, the length of the queue, especially the distribution of the queue length, is the first indicator that we study. As mentioned earlier, separation and merging operations in PSM do not change the nature of the self-similar flow. Assume that the data reprocessing probability (P) is also stable when the system runs stably. Then the aggregation flow to the data node server should be a self-similar flow, and its arrival rate is m(1 + P). Because high priority queues are almost an empty buffer in the priority queue system, we can approximate the queue length of the lower priority queue using (4) where L(t) is expressed as: 6
L(t) =
(−x + (Cn − (1 + P)m)t)2 var (1 + P)mt2H
(11)
Hx (Cn − m − mP)(H − 1)
(12)
Then via minimizing L(t), we can derive that: tˆ =
3.3.2. Queueing delay in PSM Queuing delay (Dq ) is also an important statistic that is closely related to system performance. Delay not only makes an important effect on response time, but also has a significant impact on the probability of data loss. Data usually have stringent delay bounds when transmit in modern networks. If data spend too much time, the probability of data loss increases. Thus, a priority queue is employed to process the data that need a reprocessing. Therefore, in our developed software defined architecture, we used PSM to study queuing delay. If the queueing length is x, the queueing delay is easy to obtain by E(Dq ) = x/E(Cn ) especially when the queue length is fairly long. Then we have: e−L(tˆ)/2 (13) P(D > d) ≈ q 2 p 4 2π 1 + L(tˆ) Where tˆ is the value which minimizes the function: L(t) =
(−Cn d + (Cn − (1 + P)m)t)2 var (1 + P)mt2H
(14)
3.3.3. Data loss in “queueing” Data loss, commonly caused by buffer overflow or the exceeding of the stringent delay bound, is always an inevitable problem in modern networks. However, one of the benefits of the data loss is to protect the network from the congestion states. When the buffer overflow is coming, different drop strategies are taken in communication networks. One commonly strategy is to drop the front data in the buffer rather than the newcomer. This strategy is developed to avoid more data loss cause by life cycle exceeding. Another circumstance lead to data loss is the violation of the stringent delay bound. When data spends too much time beyond its delay bound, it must be dropped. In our work, we only consider the data loss caused by violation of the stringent delay bound. That means we assume the buffer is infinite in our simulation. A series of delay bounds are set for the same input data flow to analyze the probability of data loss. Then we could predict the probability of data loss when the delay bound is specified. 4. Validation of the Analytical Models In this section, extensive experiments are carried out to validate the analysis results presented in Section III. Firstly, performance on transmission delay is compared between SQM and PSM. Data all need a waiting behavior in the buffer while transmission behavior only occurs in these reprocessed data. The transmission behavior leads to an extra time cost so that the timeliness of dealing with these reprocessed data would make an important influence on the whole user experience. Thus, a priority queue is employed in PSM to facilitate the processing for reprocessed data. Secondly, comprehensive experiments are carried out to validate the accuracy of analysis results of the PSM. Several parameters are adjusted to present the adaptability of PSM. Many factors have inevitable influences on the performance of PSM. Among them, some of the factors have apparent effects on the self-similar flow itself, such as the Hurst parameter (H) and the variance coefficient (var ) aforementioned. Several factors have apparent influence on data flow load, such as the mean arrival rate m and data reprocessing probability P. In this paper, 4 representative cases are selected for performance analysis, by adjusting Hurst parameter (H), mean arrival rate m and data reprocessing probability P. The values of other parameters are as follows. The value of var is set to be 1 typically, and the mean data processing rate of the controller (Cc ) is set to be 1 per unit time. In our simulation, the unit time is set to be 10 microseconds. More details of parameter setting are presented according to Table 1. 7
Table 1: parameter in 4 cases
Case 1 Case 2 Case 3 Case 4
H 0.80 0.80 0.80 0.75
v ar 1 1 1 1
Cn 10 10 10 10
Mean response time of reprocessed data
1
0.8
0.6
0.4
0.2
m 8.0 9.0 8.0 8.0
P 1% 1% 10% 10%
50 SQM PSM
40
30
20
10
0
0 case 1
case 2
case 3
case 1
case 4
(a) Transmission delay improvement rate (η)
case 2
case 3
case 4
(b) Mean response time of reprocessed data
Figure 4: Performance comparison between PSM and SQM
4.1. Performance comparison between PSM and SQM In the PSM, the priority queue is utilized with the intent to decrease the transmission delay and response time of the reprocessed data. As (7) represented, we use the parameter η to analyze the influence of the priority queue on the transmission delay. For the same input data flow that has a stable P (i.e. data reprocessing probability), the mathematical expectation of the queueing delay E(Dq ) is determined. Hence, the value of η is determined by the value of T c and Dc , since T com (i.e. the communication time that the message spends between controller and data nodes) could also be treated as a constant value. Fig. 4 compares the timeliness performance of reprocessed data between SQM and PSM. Fig. 4(a) shows the efficiency of the PSM through the improvement ratio on transmission delay η. Fig. 4(b) shows the superiority of the PSM in dealing with the data that need to be reprocessed by the mean response time. Because the response time of the reprocessed data is significantly longer than that of normal data, it is of great significance to reduce the average response time of reprocessed data. Through Fig. 4(b), we can see clearly that the data that need to be reprocessed always have a shorter mean response time in PSM than in SQM. 4.2. Performance validation of PSM 4.2.1. Queueing length in PSM Experiments of all of the 4 cases in Table 1 are conducted to study the queueing length performance of the data node. In Case 1 and Case 2, the mean arrival rates (m) are set to be 8.0 and 9.0 data per unit time respectively. As Fig. 5(a) demonstrated, when the mean arrival rate increase from 8.0 to 9.0, the queueing length has a huge rise. When the mean arrival rate is 8.0, the probability of queue length greater than 1000 is about 0.01% (i.e. P(Q > 1000) = 0.01%). When the mean arrival rate is set to be 9.0, the probability of queue length greater than 3000 is about 1% (i.e. P(Q > 3000) = 1%). Case 3 has a higher data reprocessing probability than Case 1, which results in a higher gathering load of data flow in Case 3 than in Case 1, and leads to an increase of the queue length. By comparing Case 8
10 0
10 0 case 1 ana case 1 sim case 2 ana case 2 sim
10 -2
P(Q>x)
P(Q>x)
10 -2
10 -4
10 -6
10 -8
case 3 ana case 3 sim case 4 ana case 4 sim
10 -4
10 -6
0
1000
2000 3000 Queueing length
4000
10 -8
5000
(a) Queueing length of Case 1 and Case 2
0
500
1000 1500 2000 Queueing length
2500
3000
(b) Queueing length of Case 3 and Case 4
Figure 5: Queue length of the low priority queue in PSM
3 and Case 4, it is clearly that the difference of data flows has an obvious effect on system performance. A small value of the Hurst parameter indicates a lower degree of self-similarity of data flow. The experimental results suggest that the higher the self-similarity, the larger probability of a congestion in the buffer. In addition, the simulation results match the analytical results well in Case 3 and Case 4. This shows that even if the value of reprocessing probability is considerable, our analytical model can still work effectively. 4.2.2. Queueing delay in PSM The queueing length distribution has a significant influence on the performance of queuing delay. The queueing delay distributions of Case 1 and Case 2 are presented in Fig. 6(a). We can see that, in Case 2 the probability of queueing delay greater than 300 is about 1% (i.e. P(D > 300) = 1%). Comparing Fig. 6(a) and Fig. 5(a), we can see that the trends are quite similar between the queueing length and the queueing delay. As we have utilized the formula: E(Dq ) = x/E(Cn ), it is understandable that the scale of x-axis in Fig. 6(a) is only one-tenth compared to that in Fig. 5(a). This indicates that the service rate of the data node is stable enough. By comparing the data delays of Case 3 and Case 1, we can find that the increase of data reprocessing probability significantly increases the delay value in a data node. The comparison between Case 3 and Case 4 shows that the data flow with weak similarity is likely to have low data latency. 4.2.3. Data loss caused by “queueing” Fig. 7 demonstrated the data loss distribution of the 4 cases caused by “queueing” behavior. Data loss occurs when data spend a significant amount of time in the system beyond their stringent delay bound. Thus, the probability of queueing delay has a direct impact on the probability of data loss. In our work, we set a series of delay bounds for each of the 4 cases to characterize the behavior of the data loss. In Fig. 7, the values of x-axis are different stringent delay bound. If the response time of the data greater than this delay bound, data loss occurs. For each of the 4 cases, a series of simulations with different delay bounds were carried out. Then, the probabilities of data loss under different delay bounds are obtained for a particular one of the 4 cases. According to the comparison of Fig. 6 and Fig. 7, it is apparent that the distribution of queueing loss has a same trend to that of queueing delay. However, the queueing loss probability is much smaller than queueing delay probability under a same delay value. For example, considering probabilities of queueing delay and data loss of Case 2, it can be found that P(D > 300) = 1% while the corresponding probability of data loss is about 0.02%, i.e. Ploss (Dbound = 300) = 0.02%. Fig. 5, Fig. 6 and Fig. 7 have given a comprehensive display of performance of the PSM. It presented the accuracy of the decomposition approach using EBA theory under a variety of the self-similar flow. Distribution functions of metrics including queue length, delay, and data loss have given a comprehensive presentation on the performance 9
10 0
10 0 case 1 ana case 1 sim case 2 ana case 2 sim
10 -1
10 -2
P(D>d)
P(D>d)
10 -2
10 -3
10 -3
10 -4
10 -4
10 -5
10 -5
10 -6
case 3 ana case 3 sim case 4 ana case 4 sim
10 -1
0
100
200
300
400
10 -6
500
0
(a) Queueing delay of Case 1 and Case 2
50
100
150
200
250
300
(b) Queueing delay of Case 3 and Case 4
Figure 6: Queueing delay of the low priority queue in PSM
10 -1 case 1 sim case 2 sim
Probability of data loss, P loss (D bound =d)
Probability of data loss, P loss (D bound =d)
10 -1
10 -2
10 -3
10 -4
10 -5
10 -6
0
50
100
150
200
250
300
350
400
case 3 sim case 4 sim
10 -2
10 -3
10 -4
10 -5
10 -6
450
(a) Data loss of Case 1 and Case 2
0
50
100
150
200
250
300
(b) Data loss of Case 3 and Case 4
Figure 7: Data loss in the low priority queue of PSM
of data node under the developed software defined architecture. The comparison of result between the extensive simulations and the analysis has shown that our analytical model has a satisfactory accuracy and applicability. 5. Conclusion With the advance of modern information technology, the analysis and utilization of big data have made impressive advances. As the growth of big data applications, people have shown great interest in the timeliness of analysis for big data. Data loss caused by the violation of the stringent delay bound may lead to data retransmission, which may further lead to redundant processing stages for these data. An appropriate start point of the data reprocessing task could contribute to the efficiency and response time of applications. Therefore, a software defined architecture is developed to decrease the response time for big data applications. In this paper, we have developed and compared two analytical models that represent two implementations respectively. The PSM which employing priority queues have been analyzed and validated to be of considerable advantages. In this model, the retransmitted data could have a higher service priority in the data node so that these data can be processed as soon as possible. In the PSM, the tight 10
coupling of the two queues makes analysis to be difficult. By using the classical EBA theory and a decomposition method, the tightly coupled two queues can be statistically equivalent to two single-queue models. Then, we applied LDP theory to make performance analysis on the PSM. We validated our analysis results by simulation experiment on a series of indexes. The experimental results show that our analytical model is not only of good universality but also of satisfactory accuracy. Finally, we compared the performance between SQM and PSM through simulation experiments. The experimental results show that the PSM has a distinct advantage over the SQM in optimizing reprocessing of data. Acknowledgment This work is partially supported by NSFC (No. 61402262), national key research and development plan (No. 2016YFB1000602), the Natural Science Foundation of Shandong Province for Major Basic Research Projects (No. ZR2017ZB0419), the Taishan Industrial Experts Programme of Shandong Province (No. tscy20150305), the Key Research and Development Program of Shandong Province (No. 2016GGX101008, 2016ZDJS01A09). References [1] S. Gaikwad, P. Nale, and R. Bachate, “Survey on big data analytics for digital world,” in 2016 IEEE International Conference on Advances in Electronics, Communication and Computer Technology (ICAECCT), pp. 180–186, Dec 2016. [2] S. Sharma and V. Mangat, “Technology and trends to handle big data: Survey,” in 2015 Fifth International Conference on Advanced Computing Communication Technologies, pp. 266–271, Feb 2015. [3] H. Fang, Z. Zhang, C. J. Wang, M. Daneshmand, C. Wang, and H. Wang, “A survey of big data research,” IEEE Network, vol. 29, pp. 6–9, September 2015. [4] H. Jiang, K. Wang, Y. Wang, M. Gao, and Y. Zhang, “Energy big data: A survey,” IEEE Access, vol. 4, pp. 3844–3861, 2016. [5] S. Yu, M. Liu, W. Dou, X. Liu, and S. Zhou, “Networking for big data: A survey,” IEEE Communications Surveys Tutorials, vol. 19, pp. 531–549, Firstquarter 2017. [6] Y. Arora and D. Goyal, “Big data: A review of analytics methods techniques,” in 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I), pp. 225–230, Dec 2016. [7] R. Yadav and A. Sharma, “A research review on approaches/techniques used in big data environment,” in 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), pp. 242–252, Dec 2016. [8] P. Pandey, M. Kumar, and P. Srivastava, “Classification techniques for big data: A survey,” in 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), pp. 3625–3629, March 2016. [9] N. Z. Zainal, H. Hussin, and M. N. M. Nazri, “Big data initiatives by governments – issues and challenges: A review,” in 2016 6th International Conference on Information and Communication Technology for The Muslim World (ICT4M), pp. 304–309, Nov 2016. [10] U. Ahsan and A. Bais, “A review on big data analysis and internet of things,” in 2016 IEEE 13th International Conference on Mobile Ad Hoc and Sensor Systems (MASS), pp. 325–330, Oct 2016. [11] J. Wang, Y. Wu, N. Yen, S. Guo, and Z. Cheng, “Big data analytics for emergency communication networks: A survey,” IEEE Communications Surveys Tutorials, vol. 18, pp. 1758–1778, thirdquarter 2016. [12] G. Hesse and M. Lorenz, “Conceptual survey on data stream processing systems,” in 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), pp. 797–802, Dec 2015. [13] H. Zhang, G. Chen, B. C. Ooi, K. L. Tan, and M. Zhang, “In-memory big data management and processing: A survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, pp. 1920–1948, July 2015. [14] J. V. Gautam, H. B. Prajapati, V. K. Dabhi, and S. Chaudhary, “A survey on job scheduling algorithms in big data processing,” in 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp. 1–11, March 2015. [15] J. Hariharakrishnan, S. Mohanavalli, Srividya, and K. B. S. Kumar, “Survey of pre-processing techniques for mining big data,” in 2017 International Conference on Computer, Communication and Signal Processing (ICCCSP), pp. 1–5, Jan 2017. [16] S. Dolev, P. Florissi, E. Gudes, S. Sharma, and I. Singer, “A survey on geographically distributed big-data processing using mapreduce,” IEEE Transactions on Big Data, vol. PP, no. 99, pp. 1–1, 2017. [17] Y. Zhang, T. Cao, S. Li, X. Tian, L. Yuan, H. Jia, and A. V. Vasilakos, “Parallel processing systems for big data: A survey,” Proceedings of the IEEE, vol. 104, pp. 2114–2136, Nov 2016. [18] M. K. Pektrk and M. nal, “A review on real-time big data analysis in remote sensing applications,” in 2017 25th Signal Processing and Communications Applications Conference (SIU), pp. 1–4, May 2017. [19] B. Yadranjiaghdam, N. Pool, and N. Tabrizi, “A survey on real-time big data analytics: Applications and tools,” in 2016 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 404–409, Dec 2016. [20] T. Borgi, N. Zoghlami, and M. Abed, “Big data for transport and logistics: A review,” in 2017 International Conference on Advanced Systems and Electric Technologies (ICASET), pp. 44–49, Jan 2017. [21] Y. Wu, G. Min, K. Li, and B. Javadi, “Modeling and analysis of communication networks in multicluster systems under spatio-temporal bursty traffic,” IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 5, pp. 902–912, 2012. [22] W. Miao, G. Min, Y. Wu, H. Wang, and J. Hu, “Performance modelling and analysis of software-defined networking under bursty multimedia traffic,” Acm Transactions on Multimedia Computing Communications and Applications, vol. 12, no. 5s, p. 77, 2016. [23] S. Chen, H. Xu, D. Liu, B. Hu, and H. Wang, “A vision of iot: Applications, challenges, and opportunities with china perspective,” IEEE Internet of Things Journal, vol. 1, pp. 349–359, Aug 2014.
11
[24] J. Hwang, A. Walid, and J. Yoo, “Fast coupled retransmission for multipath tcp in data center networks,” IEEE Systems Journal, vol. PP, no. 99, pp. 1–4, 2017. [25] P. Kasu, Y. Kim, S. Park, S. Atchley, and G. R. Valle, “Design and analysis of fault tolerance mechanisms for big data transfers,” in 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 138–139, Sept 2016. [26] T. Akgul, S. Baykut, M. Erol-Kantarci, and S. F. Oktug, “Periodicity-based anomalies in self-similar network traffic flow measurements,” IEEE Transactions on Instrumentation and Measurement, vol. 60, no. 4, pp. 1358–1366, 2011. [27] W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson, “On the self-similar nature of ethernet traffic (extended version),” IEEE/ACM Transactions on networking, vol. 2, no. 1, pp. 1–15, 1994. [28] X. Jin and G. Min, “Performance analysis of priority scheduling mechanisms under heterogeneous network traffic,” Journal of Computer and System Sciences, vol. 73, no. 8, pp. 1207–1220, 2007. [29] V. Paxson and S. Floyd, “Wide area traffic: the failure of poisson modeling,” IEEE/ACM Transactions on Networking (ToN), vol. 3, no. 3, pp. 226–244, 1995. [30] J. Beran, Statistics for long-memory processes, vol. 61. CRC press, 1994. [31] I. Norros, P. Mannersalo, and J. L. Wang, “Simulation of fractional brownian motion with conditionalized random midpoint displacement,” Advances in Performance Analysis, vol. 2, no. 1, pp. 77–101, 1999. [32] A. Field, U. Harder, and P. Harrison, “Measurement and modelling of self-similar traffic in computer networks,” IEEE ProceedingsCommunications, vol. 151, no. 4, pp. 355–363, 2004. [33] Y. Fan and N. D. Georganas, “On merging and splitting of self-similar traffic in high speed networks,” in PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION, pp. 702–707, Citeseer, 1995. [34] P. Mannersalo and I. Norros, “A most probable path approach to queueing systems with general gaussian input,” Computer networks, vol. 40, no. 3, pp. 399–412, 2002. [35] M. Mandjes, P. Mannersalo, and I. Norros, “Priority queues with gaussian input: a path-space approach to loss and delay asymptotics,” in Proceedings of the 19th International Teletraffic Congress (ITC-19), pp. 1135–1144, 2005. [36] P. Mannersalo and I. Norros, “Approximate formulae for gaussian priority queues,” Teletraffic Science and Engineering, vol. 4, pp. 991–1002, 2001. [37] X. Jin and G. Min, “Modelling and analysis of priority queueing systems with multi-class self-similar network traffic: a novel and efficient queue-decomposition approach,” IEEE Transactions on Communications, vol. 57, no. 5, pp. 1444–1452, 2009.
12
Biography of all Authors: Lei Liu is an associate professor, school of computer science and technology, Shandong University, Jinan, China. His research interests include Multimedia network and security, cloud computing and networks performance. Yongjian Ren is now pursuing his Ph.D degree in department of computer science and technology, Shandong University. He currently focuses on the network performance engineering, big data processing, cloud computing. Lizhen Cui is a professor, school of computer science and technology, Shandong University, Jinan, China. His research interests include service computing, cloud computing and cyber-physical systems. Yuliang Shi is an associate professor, school of computer science and technology, Shandong University, Jinan, China. His research interests include service computing, cloud computing and database. Qingzhong Li is a professor school of computer science and technology, Shandong University, Jinan, China. His research interests include data science and intelligent data analysis, data computing and cloud computing software architecture.
Lei Liu
n Yongjian Ren
Lizhen Cui
Yuliang Shi
Qingzhong Li L
Highlight for review: 1. A software defined data processing architecture is developed. When the data needs to be reprocessed, the controller gives the data a token to improve the system's data processing efficiency. 2. An analytical performance model which characterized the data processing structure with the software defined controller is developed. A high-priority queue is employed to handle data that needs to be reprocessed to reduce the delay of these data. 3. A decomposition approach is developed, which divides the complex system into a number of statistically equivalent single queueing systems to evaluate the data processing performance of the software defined data processing architecture.