The optimization for recurring queries in big data analysis system with MapReduce

The optimization for recurring queries in big data analysis system with MapReduce

Accepted Manuscript The optimization for recurring queries in big data analysis system with MapReduce Bin Zhang, Xiaoyang Wang, Zhigao Zheng PII: DOI...

654KB Sizes 7 Downloads 122 Views

Accepted Manuscript The optimization for recurring queries in big data analysis system with MapReduce Bin Zhang, Xiaoyang Wang, Zhigao Zheng

PII: DOI: Reference:

S0167-739X(17)31020-8 https://doi.org/10.1016/j.future.2017.09.063 FUTURE 3719

To appear in:

Future Generation Computer Systems

Received date : 15 May 2017 Revised date : 4 September 2017 Accepted date : 24 September 2017 Please cite this article as: B. Zhang, X. Wang, Z. Zheng, The optimization for recurring queries in big data analysis system with MapReduce, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.09.063 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

The Optimization for Recurring Queries in Big Data Analysis System with MapReduce

Bin Zhang1,2, Xiaoyang Wang2, Zhigao Zheng3* 1.Zhejiang University of Finance & Economics, Hangzhou 310018, China 2.School of computer science and technology, Fudan University, Shanghai, 201203, China

3.School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China *Corresponding author Email: [email protected]

Abstract As data-intensive cluster computing systems like MapReduce grow in popularity, there is a strong need to promote the efficiency. Recurring queries, repeatedly being executed for long periods of time on rapidly evolving data-intensive workloads, have become a bedrock component in big data analytic applications. Consequently, this paper presents optimization strategies for recurring queries for big data analysis. Firstly, it analyzes the impact of recurring queries efficiency by MapReduce recurring queries model. Secondly, it proposes the MapReduce consistent window slice algorithm, which can not only create more opportunities for reuse of recurring queries, but also greatly reduce redundant data while loading input data by the fine-grained scheduling. Thirdly, in terms of data scheduling, it designs the MapReduce late scheduling strategy that improve data processing and optimize computation resource scheduling in MapReduce cluster. Finally, it constructs the efficient data reuse execution plans by MapReduce recurring queries reuse strategy. The experimental results on a variety of workloads show that the algorithms outperform the state-of-the-art approaches.

Keywords: big data; recurring queries; MapReduce; data reuse; local schedule

1. Introduction With the rapid development of information technology, all kinds of data of human society presents exponential increasing. Data intensive data analysis, such as: online advertisement, log processing, network intrusion detection and other needs, the traditional information processing and computing technology has been difficult to deal with in big data environment. In recent years, the ability to effectively process such huge amounts of data has become a key factor in driving business decisions. MapReduce has recently emerged as a new paradigm for large scale data analytic due to its high scalability, fault tolerance, and flexible programming model. Many famous companies such as Google, Amazon, Facebook, Taobao, and many others have embraced MapReduce[1], and its open-source implementation YARN, to perform large-scale analytic applications on evolving big data.

1

In complex data analysis applications, such as log processing of Internet companies, news updates and abstract social networking service promotion, recurring queries [2] often appear, the features of the system are periodic updating massive data, and must quickly real-time query processing. Nevertheless, the same query analysis to changing data needs to be performed periodically in those case, the query value depends on the granularity of data by user interest, recurring queries periodically re-executed that it may continue a few hours or a few days, or even months. The temporal characteristics of the data query in the real environment of the recurring queries are required for the existing computing model. The characteristics of Recurring queries under the big data environment are massive data, high speed, variety of data types, low value density. Thus it must tackle the challenges between the data intensive query load and users of a scalable real-time increasingly dependent. Therefore, recurring queries has become a focus recently of the big data research. Traditional databases often use query reuse to improve query efficiency. Data reuse technology makes full use of the relationship between the query, reduce the amount of system storage, shorten the user response time, is an important research content of database management. However, due to its lack of scalability, can only rely on a single server, in the big data analysis, the computing resources are often exhausted, incompetent frequent query load pressure, the urgent need to take effective measures, such as the introduction of MapReduce[2] distributed parallel computing framework to build scalable database.

2. Related work 2.1. Parallel database research The traditional approach to large data processing is to use a parallel database system. Parallel database system is in the massively parallel processing system high performance database system and cluster parallel computing environment is established on the foundation. This system is composed of many loosely coupled processing units, here refers to the processing unit rather than the processor. Each unit has its own private CPU resources, such as bus, memory, hard disk and so on. There are examples of a copy of the operating system and database management in each unit. This structure is the largest not shared resources. In the field of foreign start early, in 1980s it has to run in non-shared database system of nodes on the cluster. The system supports the standard relational tables and SQL, at the same time for the end user data is actually stored on multiple machines transparently Many of these systems are based on the results of the pioneering research on the Gamma [3] and Grace[4] parallel DBMS projects. In the late 1980s, the research direction of parallel database technology gradually turned to the general parallel machine and the research key point is the physical organization, operation algorithm, optimization and scheduling of parallel database. As early as the mid of 80’s in the 20th century, Teradata and Gamma project began to explore through a high-speed Internet connection with independent CPU based on a new parallel database schema consists of main memory and disk share-nothing node cluster, as shown in Figure 1.

2

Figure 1: Thee parallel data abase architectu ure.

2.2. MapReduce model In 2004, afteer studying the data storagee and parallel processing of o the web, G Google researrcher propposed the MaapReduce com mputing moddel [5]. When n dealing with h large data, only need to o run on aan ordinary coomputer, do not n need to bbe like paralleel database sy ystem needs hhigh-end serv vers, cost--effective higgh. In 2008, Apache lauunched the Hadoop H projeect based on the MapRed duce fram mework. In October 2013 second generration Hadoo op, namely YA ARN[6] (yet another reso ource negootiator) frameework for rellease, big dat ata analysis provides p a new w efficient ssolution of model m and method, espeecially its stro ong flexibility ty provides a rich interfacee for the scheeduling and reuse r strattegies implem mented in Haadoop. The jjob tracker will w be respon nsible for a task into sev veral smalll tasks and assign a each task nodes in thhe Hadoop clluster. And th he implementa tation of real--time trackking of eachh computing node task. T The Hadoop task scheduling use the FFIFO schedu uling strattegy. The liteerature [7] prroposes an allgorithm for data stream segmentationn of MapRed duce baseed on panel window w metho od. The algorrithm will equ ual to the sizee of all data s egmentation.. But this can only bee properly used u to dividde the input data source for a frequeent query, query q optim mization of data d in the sccene which ffeatures frequ uent decision ns cannot be discussed in this papeer queries. 2.3. Query Reuse As an impoortant method to improvve the perfo ormance of database d queery, query reuse r technology [8] iss a hot researrch topic in ddatabase field d. Vertica [9] calculated thhe cost of 2 kinds k of sttrategies by the t physical and chemicall balance mo odel, and then n chose the llow cost physical strattegy. MonetD DB [10] proposed a selff-organization n tuple reconstruction sttrategy based d on Craccker Map. Thhe defect of th his approach is that the maaintenance co ost of Crackerr Map is too high when processing large data, which w seriouslly affects the query efficiency. In addition, the t literature [13] on the bbasis of the proposed p MR RShare basedd on the proposed MappReduce algoorithm, to ach hieve shared scan, Map output o stage sharing, s Mapp share and share s the M Map process, with a certain innovatioon. However,, the defect is shared time me only for siingle tablee queries, andd the algorith hm is requiredd before querry tasks, anallysis of all thhe query task k, the task group meetss the sharing condition, thhe high cost of system ressources. In thhe literature [11], the ddynamic adjuustment of thee physical andd chemical seet is carried out o on the stattic algorithm m, but its aadjustment strrategy is too complicated, c and the efficciency is not ideal. Restore re [12] system m for the M MapReduce task of the in ntermediate rresults generrated by the storage s reusee managemen nt. It will reuse managgement work k with MapReeduce and Reduce R Outpu ut Map materrialized, and then throuugh the subseequent task to o determine w whether reusee, thus avoidin ng redundantt task schedulling. In summary, the current recurring r querries have nott yet been tho oroughly studdied, especially in the distributed environment e of MapRedduce how to o conquer th he increasingg complexity y of 3

data-intensive workloads.

3. Framework This section introduces the recurring queries model from literature [2] firstly. Furthermore, according to the characteristics of distributed parallel cluster of MapReduce environment, the models of MapReduce consistent window slice, MapReduce late scheduling strategy and MapReduce recurring queries reuse gradually derived from the recurring queries optimization model.

3.1. Recurring Query model In MapReduce parallel computing environment, recurring queries is the data set of disk resident, according to the specified time period to execute batch query tasks. We can clearly find out the difference among OLTP, OLAP and recurring query environments from the following table1. OLAP OLTP Recurring Query Query characteristics Ad-hoc queries long-lived periodic continuous queries queries Data processing batch processing batch incremental real-time processing processing Computing hardware disk-based disk-based memory-based Volume big large small Table 1: The comparison table among OLTP, OLAP and Recurring Query

The basic definition of the recurring queries model is given below: 1 Definition. Suppose there is a user query task set  (Q)  {q1 , q2 ,..., qn } , each recurring queries is defined as a tuple qi  (i ,  i ) ,Where ω represents the amount of data processing of the specified size, and σ is query execution frequency. For example, qi  (i ,  i ) =(40,20)specified that every 20 minutes to execute processing task which collect data within the last 40 minutes. Query processing time is the key factor that affect the efficiency of the recurring queries. We assume that the batch file contains the tuple set is not overlapping for the convenience of analysis. Thus, suppose that there exists a set of input files for the data query that contains all the tuple

 ( F )  { f1 , f 2 ,..., f k } , and {T (i ), T ( f1 )  , T ( f1 ), T ( f 2 )  ,..., T ( f k 1 ), T ( f k )  represent the corresponding time, where {(T (qi ), T ( q j )) | qi , q j   (Q ), T (qi )  T (q j )} for time to meet the conditions. Let fi denotes a time sequence, but there is no order constraint between the tuples in each file. Then there are two consecutive batch of execution tasks Ei and Ej, and corresponding execution time are Ti and Tj, provided that {Ti , T j  T | Ti  T j | . The input data mode of this system is the HDFS file form of a number of batches, there are time series

{Tn ( f1 ), Tn ( f 2 ),..., Tn ( f k )}

,

respectively

4

corresponding

File

sequence

{Tn ( f1 ), Tn ( f 2 ),..., Tn ( f k )} , where Ti  Tn ( f1 ), Tn ( f 2 ), ..., Tn ( f k )  T j . The model analysis in big data applications represent recurring queries in general, to meet the most real environment characteristic analysis system. For example, in the log query processing system, process per one hour from a cluster of machines to collect the latest log files, the batch file upload to the HDFS as a new process query task.

3.2. MapReduce consistent window slice model This section analyzes how recurring queries on the MapReduce slice input data source effectively to reduce disk I/O operation, so as to reduce the cost of whole query task. We give the slice window model which is defined as follows: 2 Definition. Let windows W= {W | S (W )  w} can be decomposed into a set of n slices

 ( R )  {s1 , s2 ,..., sn | Start ( si )  0, End ( si )  Start ( si 1 )} . The size of each partition is a collection of si is expressed as:

 (R)  {r1 , r2 ,..., rm | 1  i  n, ri  End (si )  Start (si )}

(1)

3.3. MapReduce late scheduling strategy model In the same computation resource, data and query tasks are the decisive factors that affect the performance of MapReduce cluster computing. Network bandwidth in MapReduce cluster is much smaller than that of cluster nodes, and the network transmission delay is fatal flaw. Scientific and efficient parallel scheduling will greatly increase the computing network I/O of the cluster. If the data can be achieved in scheduling at the nodes that are close in distance, Therefore, we consider the computation will be the fastest and efficient while the required data of the task in a job have been all scheduled on the node before without data transmission through the network. However, through our survey at existing several Internet cloud computing platforms, we found out that the reality is often unsatisfactory, it's good enough that the job run in the same machine frame in most cases. Accordingly, this is the reason why the MapReduce proximal scheduling model and the MapReduce late scheduling strategy are proposed in this paper. The basic of MapReduce proximal nodes is defined as follows: 3 Definition. Proximal node task. Let

 (Q)  {q1 , q2 ,..., qn } is a set of recurring queries

and   { 1 ,  2 ,...,  x } is MapReduce node set. The query input file which is needed by the recurring queries qi is denoted by Input(qi ) , the HDFS file of the node  x is denoted by

HDFS( x ) . If

{ x   qi   (Q) | Input(qi )  HDFS( x )} , the task of recurring

queries qi is the proximal node task. Suppose that the size of computing cluster of MapReduce datanode is denoted by M, each node’s CPU core number is denoted by H, then the total computing power can be simply expressed as S  M  H . Let the set of recurring queries tasks  (Q)  {q1 , q2 ,..., qn } , where qi is the proximal node task among Nodek. In order to simplify the analysis, let all tasks are 5

performed with the same length of time which is {qi | qi  T } , where any of two tasks are irrelevant and denoted by {qi , q j   (Q ) |  qi ,q j  0} .

3.4. MapReduce recurring queries reuse model The basic definition of MapReduce reuse group and MapReduce execution order is as follows: 4 Definition. Reuse Group. Given a set of recurring queries tasks  (Q)  {q1 , q2 ,..., qn } , the corresponding query batch set  ( )  {1 ,2 ,...,n | i   (Q)} , where each of subset of (1)

 (Q) , called reuse group. i satisfies the following two conditions:

i   j   , {i, j |1  i, j  k , i  j} , where i and j in the time sequence{l, k}; n

(2)

i is a



i

i 1

  (Q) ,where all of the i forms the recurring queries set  (Q) .

5 Definition. Execution order. In the reuse group RG, the time to start execution of the Gi is denoted by Ti (begin) , and the execution time is denoted by Ti (end ) . Let each query group will use all the available computation resource on the MapReduce cluster by default in order to accomplish the task as early as possible. Therefore, an efficient execution order is denoted by

 (Q)  1  2  ...i   j   ...n which is a sequence of RG. 4. MapReduce consistent window slice algorithm Through the analysis of the existing MapReduce scheduling strategy and recurring queries features, we design MapReduce consistent window slice algorithm(MCWSA) based on MapReduce consistent window slice model, and present optimization of recurring queries. The basic idea of the MCWSA is to partition the data sources as far as possible into the slices with different constraint window belonging to multiple recurring queries. Given a recurring query set

 (Q)  {q1 , q2 ,..., qn } for the same input data source, where

{qi , q j  (Q), i  j |  (qi )   (q j ) &  (qi )   (q j ) &  (qi )   (q j )} simplification,

here

suppose

that

all

queries

have

the

same

.

For

the

start

time

{qi , q j   (Q) | i   j , i  j} . Through the mathematical analysis of recurring queries qi with different ω and σ, we found out that a share window

 ' that covers consecutive recurring

queries qi and qj is most efficient window slice strategy. The MCWSA choice the least common multiple of the ωi and ωj as the share window  ' , because the query space loss is minimal.The formula is expressed as:

{qi (i ,  i ), q j ( j ,  j )   (Q) |  '(qi , q j )  LCM(i , j )}

(2)

For example, if the ωi , ωj are 24 and 36 respectively, their share window  ' is 72. The MCWSA calculates the size of the window  ' , and partition the window by unequal size according to the ω and σ. Let the slice window of recurring queries is the share window  ' ,

6

od  i , s j   i  si } . Plainly thenn it is divided into a pair off slices with s i and sj, wheere {si  i mo y, the MCW WSA will onnly be a partition window iinto two slicees, and the original algoritthm[12] using g the averrage partitionn method partitioned windoow into several pieces, wh hich will prooduce many small filess in HDFS. As A is known to t all, in the general case of MapRedu uce frameworrk for very small file ccomputation is inefficienccy, so the algoorithm is better than that algorithm. nd the system m allows arbittrary Next, we disscuss general case when sstart time υ iss different, an logiccal start timee. Let a triplle qi  (i ,  i ,i ) for reecurring querries, where sttart time of qi is denooted by

i . The MCW WSA sort qi by i ascending. Then n, it calculatte the maxim mum

deviiation of

Max(| i   j |)) . It modiify the shaare window i , where  ( start )  M w as

{i  i  (i   ( start ))} . It can be seeen that the MCWSA M algorrithm is to bu uild a share slice s window w covering all the recuurring queriess  (Q ) , and d divide it innto a unequ ual set of sliices S (Q) . T The algorithm m is desccribed as folloows: Algorithm m 1. MCWSA A. INPUT:  (Q ) ; OUTPUT T: S (Q) . 1 S (Q) =  ○ 2 ○ for f qj in  (Q Q ) do 3 ○ for qi in  ( Q ) do 4 ○  ( start )  Max(| i   j |) 5 ○ if  (startt ) > 0 then 6 ○ i  i  (i   ( startt )) 7 ○ end if 8  ' = LCM ○ M(i , j ) 9 ○ si  i mood  i , s j   i  si 10 S (Q) .addd(si,sj) ○ 11 ○ end for 12 S (Q) ○ return r

T The paired windows w for qA and qB are (3,1) and (3,3), respectiveely. Then thee window W = 12 can be partitioned into the sliced window W= (3,1,2,1,,1,1,2,1), whiich can satisfy fy both queriees as n in figure 2. the sshare executioons as shown

Figure 2:

Th he window slicce of the MCW WSA. 7

Now let threee recurring queries q qA,qB and qC with the correspon nding three trriple respectiv vely: (ωA= =9,σA=4,υA=0),(ωB= 9,σB= 6,υ B=2)and(ω ωC=5,σC=2,υC=1)are scheduled. The MCW WSA algorithhm calculate the share w window and divided d the reecurring querries into slicees as show wn in figure 3: 3

Figure 3:

The timeline of o the MCWSA A.

Obviously, the t MCWSA A algorithm nnot only creaate more opp portunities foor reuse, but also s w while the inpu ut data proceessing, can grreatly reducee the throuugh the fine granularity scheduling reduundant data upload, u effectiively avoid rrepeated data computation n and archive the optimizaation of syystem worklooads.

5. M MapReducee late sched duling strattegy In order to improve thee efficiency oof data sched duling in datta intensive eenvironment, we SS (MapRedu uce late schedduling strategy y) in this pap per. proppose the MLS The problem m often com me out that tthe task can n never be scheduled witithout satisfy y the proxximal node condition c duee to the striict sort orderr. In order to t avoid thatt, we design n the MappReduce late scheduling strategy. Firstlly, when a no ode requests a task, if the qi in front off the queuue while the conditions mentioned m aboove cannot be attained an nd cannot gett the resourcees of this node to start the task M (qi ) , then thee MLSS will skip the qi, and a process thhe tasks qj , where w

{q j | i  j  n} . Secondly, iff the interval oof the job is skipped s has exceeded the tthreshold τ of the  (qi )  (Ti  Tk )   | i  k  n} , MLSS systeem, where { M begins to t wake qi upp and processs the task M (qi ) . Laastly, it avoid ds the Barrel E Effect that afffects the efficiency of enttire job. MLS SS is a th hat the queryy task is acco omplished verry fast in thee proximal no odes. founnded on the assumption The node will coomplete the calculation off the task in a few seconds and releasee the computaation resouurce. The ML LSS execute times is denooted by L. It should be no oted that, oncce a job has been b execcuted MLSS with w count off L, the MLSS S allows it to randomly staart a number of proximal node n taskss, without havving to reset L. MLSS is ddescribed as follows: f Algorithm 2. MLSS. A M IN NPUT:  (Q ) ; O OUTPUT: M (qi )

8

1 ○ 2 ○ 3 ○ 4 ○ 5 ○ 6 ○ 7 ○ 8 ○ 9 ○ 10 ○ 11 ○ 12 ○ 13 ○ 14 ○ 15 ○ 16 ○ 17 ○

for qi in  (Q ) do qi. scount = 0 while Noden != offline if Noden =  then Sort qi for qi in  (Q ) do if qi (task)!= null && (qi (task).data in Noden) then qi (task)→Noden qi. scount = 0 else if qi (task)!= null then if j. scount ≥ L then qi (task)→Noden else qi. scount ++ end if end if end for end if

Next, we will analyze how the parameter L of MLSS scheduling affect the data scheduling and communication costs in detail, as well as how to achieve the target of scheduling of data efficiently. Given a recurring queries task qi, where mathematical expectation of the probability of node is idle is denoted by E (qi ) . If the non-proximal node task qi are wakened from wait until reached the threshold τ , the probability of qi finding the proximal node task can be denoted by

1  (1  E (qi )) L . Obviously the probability increases exponential with L. For example, there are 20% nodes with performing the calculation task, where E (qi )  0.2 , then when the L=10, with 89.2% of the probability to achieve the task of proximal node scheduling, and when L=30, there will be 99% of the probability to achieve the task of proximal nodes scheduling. It fully proves the existence of the increasing relationship, which theoretically proves the validity of the MLSS for query optimization. Secondly, it analyzes the time problem of MLSS. Suppose the current MapReduce cluster computation resource is denoted by S  M  H , then mathematical expectation of each CPU core free can be denoted by E free ( slot j )  T / S . Therefore, once a recurring queries task qi pop to the front of the queue, it will wait until W (qi )  D  E free ( slot j )  DT / S to start the proximal node task. And if the S is relatively large enough, the waiting time is denoted by

W (qi ) will be far less than the average task execution length is denoted by T (qi ) , where lim(W (qi ) / T (qi ) )  0 . The task of waiting for proximal node may be more expensive than running a non-proximal end node, and this rule has been demonstrated in the 7.2 section of the experiment. In addition, it is found that when the number of nodes is fixed, W (qi ) is linearly decreased with H . The objective of the optimization of the MLSS algorithm is to analyze how to set the L to achieve the optimal rate of proximal node task. The MLSS consists of the following steps: First of all, suppose that the rate of algorithm hopes to achieve a proximal node task is greater than λ, Hadoop replication factor is denoted by R, the cluster waiting task is denoted by N. Next, the algorithm will calculate the N task in its life cycle Tn on average, it starts a proximal 9

node task, and the remain waiting recurring queries qi task set can be denoted by

{tkk , tkk 1 ,..., tk1 | tkk  qi } . Then the mathematical expectation of qi task scheduling is denoted by E (qi )  1  (1  K / M ) , because the probability of given node does not contain the input R

file of qi is (1  K / M ) R . Thus, the probability of starting a proximal node task on the Nodek for the qi is:

1  (1  E (qi )) L = 1  (1  K / M ) RL  1  e RLK / M {1  k  n}

(3)

The average value of K between 1 and N. the mathematical expectation of the proximal node task scheduling rate of the qi can be deduced as following: N

L(Q) 

1  e



RLK M

K 1

N N

(4)

e

 1  K 1

 RLK / M

N



If the limit as N approaches infinity: 

e

 RLK / M

L(Q)  1  K 1

N e  RL / M  1 , N (1  e  RL / M )

If L(Q)   ,

(5)

solve it and we have:

L

M  (1   ) N  ln  . R  1  (1   ) N 

(6)

In summary, the conclusions are drawn as followings in MLSS; a) Non-proximal node task is exponential decrease with L; b) The time required to achieve proximal node task is much smaller than the average time of the given task; c) L shows a significant linear correlation with the with N.

6. MapReduce recurring queries reuse strategy We present MRQRS (MapReduce recurring queries reuse strategy) in order to improve efficiency of recurring queries. The difficulty of MRQRS in data reuse is the mutual dependency among recurring query. Therefore,our solution is to construct efficient MapReduce reuse group (RG). On the one hand, MRQRS scan

 (Q)  {q1 , q2 ,..., qn } and give priority to the set of RG

in the recurring queries for scheduling execution, while on the other MRQRS determine the execution sequence of RG. The RG is used as a criterion to evaluate the candidates, and eliminate the non optimal as early as possible, to achieve the fast convergence of the search space. MRQRS first determines the RG from a given

 (Q) , where each group uses the formula described in the Definition 4 to

calculate its weight  (Gi ) . MRQRS solve the problem of the permutations and combinations of recurring queries. while it firstly establishes a network search space system, in which each Datanode simulates a possible 10

RG. Moreover, MRQRS calculate all possible execution sequence at each node. The traditional simple search algorithm to traverse the entire space, especially in the current environment of big data, is almost impossible. MRQRS efficient the elimination of non-optimal candidate set by execution sequence and RG as the evaluation criterion. It can effectively reduce this complex search space. In order to calculate RG quickly, MRQRS calculate the cost of RG. C ( RG ) is expressed as follows: Gi

C ( RG )' 

 (tshare ) Cshare (Gi )



 (t ) k 1 Gi

k

(7)

 C (q ) k

k 1

where the cost of executing the reuse of Gi is denoted by Cshare (Gi ) , and the sum of Gi

calculates the total cost for each RG Gi is denoted by

 C (q ) , the cost of calculates RG is k 1

denoted by

 (tshare ) .

k

the total cost of those queries without MRQRS is denoted by

Gi

 (t ) . k 1

k

The C ( RG ) shows a significant linear correlation with the query optimization from analysis formula 7, then the execution priority should be set naturally higher .Therefore, the next step is to form all possible reuse group grouping queries based on the calculation of each

C ( RG ) in concept, MRQRS choose best C ( RG ) of RG as execution priority. MRQRS is put forward on the basis of a greedy search strategy. MRQRS which can calculate the optimal RG of recurring queries  (Q) is described as follows: Algorithm 3. MRQRS. INPUT:  (Q ) ; OUTPUT: RG 1 ○ 2 ○ 3 ○ 4 ○ 5 ○ 6 ○ 7 ○ 8 ○ 9 ○ 10 ○

GS =  for Gi in  (G ) do if C ( RG ) < 0 then  (G ) .del(Gi) else if !(Gi in GS) then GS.add(Gi) end if end for GS.sort return RG

7.Experimental evaluation All experiments use the HRQMS (Hadoop Recurring Query Management System) developed by the research group to evaluate the platform, and the international SSB[9] dataset to verify proposed algorithm.

7.1 Experimental Setup HRQMS is developed by the Java 1.7 with the YARN interface to modify the expansion.The experiment system is run on the 50 nodes of cluster. Each node is 4 3.1GHz CPU, 4 GB memory 11

and 500GB SATA A hard disk, the t operatingg system is Redhat R Linux 6.5, the netw work environm ment Gbps Etherneet LAN.This experiment e u sed the MapR Reduce softw ware of YARN N (Hadoop-2.3.0), is 1G com mpared with big data analysis software H Hive [18] is used u to suppo ort the latest vversion of YA ARN 0.144.0, the tradiitional data warehouse w with DWMS S3.0 which is the previious research h in coluumn-store database . Experimentss evaluate our algorithms bby SSB's 13 benchmark statements as recurring queeries

 (Q Q0 )  {q1 , q2 ,..., q13} .Thhe configuratiion of the HD DFS data blocck size is 2566MB, the Had doop querry actuator was w set to gllobal memoryy size of 1G GB. Hadoop replication ffactor was seet to defaault value. Exxperiments reecord each daata set in the Hadoop log and the querry execution time and frequency, annd re-format the t cluster HD DFS after eacch experimen nt.

7.2 E Experimenttal results an nd analysis Experimentt 1. Performaance comparisson of simply y recurring qu ueries. Experiment select Q1.1 with join op eration, Q2.1 1 with simplee aggregationn, Q3.1 and Q4.1 Q withh complex aggregation fro om SSB as thhe

 (Q) . And A experimeent set 20 tim mes of each query q

as σ in recurrin ng queries, 50 5 nodes, andd 100 GB dataa sets for each h query recurrrence, and reecord the aaverage execuution time.

Figure F 4:

Thee comparison of o different qu ueries.

The results in Figure 4 illustrate thee improvemen nt of the HR RQMS over H Hive and DW WMS obviiously, especiially in complex aggregatiion tasks Q3.1 and Q4.1. HRQMS H outpperforms Hiv ve by 23% % on average execution e tim me and DWM MS by 42% resspectively. Experimentt 2. Effectiveeness of slicee window straategy. In this slice window exp periment, we also use 20 queries, q 50 no odes, and 1000 GB data seets to test the optimizattion effect off the MCWSA A. The clusteer distribution n factor calledd tuple overlaap is ween two connsecutive recu urring queriess. the ttuple data oveerlapped betw

12

Figu ure 5: The com mparison in diffferent ratio off overlap.

As shown in i Figure 5, the executioon time of HRQMS is less than Hivee 46%, when n the overrlap rate is 500%, and the execution e tim me of HRQM MS is less than n Hive 57% w when the oveerlap rate is 90%. It indicates i thatt MCWSA nnot only creaate more opportunities forr reuse recurrring querries, but also give the inpu ut data to finee granularity scheduling. Thus T it can ggreatly reducee the reduundant data looading, effecctively avoid repeated datta calculation n, then signifficantly improves the rresponse timee for recurring g queries.

Figure 6: Thee comparison iin different wo orkloads with same s overlap rrate.

To verify the ability of th he system whhile running in i high data load l conditionns, 10 trials were w perfo formed underr all node con nditions at eaach of the fo ollowing sizes:10GB, 1000GB, and 1TB B of dataa. Fig. 6 show ws the averag ge time of exxecution for each workloaads with sam me overlap raate is 50% %. HRQMS has a gentler upward u trendd, and when the load is inccreased to 1T TB, the execu ution timee of HRQMS S decreases ab bout 26.2% aas compared with Hive. This T result fuully demonstrrates the ssuperiority off the load cap pacity of HRQ QMS. Experimentt 3. Effectiveeness of MappReduce late scheduling. In order too examine th he MLSS annd small fille workloadss performancce improvem ment, expeeriments on Map M task werre set to 10, 220, 30, 40 an nd 50. For eacch of its recuurring query load, l the eexperimental choice of wh hich the task iis similar to the t size of Q3 3.1. Experimeents set

 ' = 15s

for M MLSS compaared with Had doop in FIFO O. The results in i average ru unning time oof different Map M tasks, as depicted d in FFigure 7 provee the succcess of MLSS S approach ass well. The jjob with the greatest imprrovement runns 7%, 11%, 19%, 1 26% %, and 31% reespectively faaster under M MLSS than FIF FO of Hadoop p.

13

Figure 7: Th he comparison in different Map. M

Experimentt 4. Effectiveeness of MappReduce recu urring queries reuse strateggy. For the MappReduce recu urring queries reuse experiment, we usee 30 SSB queery jobs, 50 nodes and 100 GB datta sets. For each e query, w we change th he rate of tup ple overlap, and evaluatee the efficciency of the recurring queeries processiing.

Figure 8: 8 The compariison in recurriing queries reu use strategies.

Figure 8 achhieves the besst improvemeent that the HRQMS H execu ution time deecrease obviou usly, espeecially when the overlap is 90%, HR RQMS query y optimizatio on is higher than Hive 43%. 4 Overall, we obseerve that HR RQMS signifiicantly impro oves on the run-time r for rrecurring queeries when the reuse sttrategy is enaabled. In summaryy, the experim ments evaluatee execution tiime and work kloads capabiility. the HRQ QMS apprroach guarantees to produ uce an optim mal solution for f execution n of recurringg workloads with neglligible optimiization time overhead. o

8. C Conclusion This paper presented th he first targeeted optimizzation of reccurring querie ies efficiency y by oposes the MapReduce M coonsistent win ndow MappReduce recuurring queriess model. Seccondly, it pro slicee algorithm, Thirdly, T in teerms of data sscheduling,it designs the MapReducce late schedu uling strattegy that im mprove data transmissionn and optim mize computaation resourcce scheduling g in MappReduce clustter. Finally, it constructs tthe efficient data d reuse ex xecution planns by MapRed duce recuurring queriess reuse strateg gy. Our experrimental resullts on a varietty of workloaads show thatt our propposed algorithhms outperfo orm the statee-of-the-art ap pproaches co onsistently byy up to 50%.. We now w discuss seveeral methods in which alggorithms can be b generalizeed.We expectt other distrib buted appllications that must handlee recurring quueries or dataa scheduling in load can aalso benefit from f our pproposed algoorithms.

Ack knowledgem ments 14

This work is supported by Zhejiang philosophy and social science planning project (Grant No.17NDJC179YB).

References [1] Michael Stonebraker, Sam Madden, Pradeep Dubey, Intel "big data" science and technology center vision and execution plan, SIGMOD Record 42(1) (2013) 44-49. [2] Chuan Lei, Zhongfang Zhuang, Elke A. Rundensteiner, and Mohamed Eltabakh, Shared Execution of Recurring Workloads in MapReduce, PVLDB, 8(7)(2015)714-725. [3] Nidhi Tiwari, Santonu Sarkar, Umesh Bellur, Maria Indrawan, Classification Framework of MapReduce Scheduling Algorithms, ACM Computing Surveys 47(3) (2015) 49. [4] Daniel Reed, Jack Dongarra, Exascale computing and big data, Commun. ACM 58(7) (2015) 56-68. [5] Hari Singh, Seema Bawa, A MapReduce-based scalable discovery and indexing of structured big data. Future Generation Comp. Syst. 73(2017)32-43. [6] Jihoon Son, Hyoseok Ryu, Sungmin Yi, Yon Dohn Chung, SSFile: A novel column-store for efficient data analysis in Hadoop-based distributed systems, Information Sciences 316 (2015) 68-86. [7] Ce Zhang, Arun Kumar, Christopher Ré, Materialization Optimizations for Feature Selection Workloads. ACM Trans. Database Syst. 41(1) (2016)2:1-2:32. [8] Valerie Barr, Michael Stonebraker, A valuable lesson, and whither Hadoop? Commun. ACM 58(1) (2015) 18-19. [9] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, Daniel Abadi, Fast Distributed Transactions and Strongly Consistent Replication for OLTP Database Systems, ACM Trans. Database Syst. 39(2) (2014) 11. [10] Chulyun Kim, Kyuseok Shim, Supporting set-valued joins in NoSQL using MapReduce, Information Systems 49 (2015) 52-64. [11] Tomasz Nykiel, Michalis Potamias, Chaitanya Mishra, George Kollios, Nick Koudas, Sharing across Multiple MapReduce Jobs, ACM Trans. Database Syst. 39(2) (2014) 12. [12] Ahcène Boukorca, Ladjel Bellatreche, Sid-Ahmed Benali Senouci, Zoé Faget, Coupling Materialized View Selection to Multi Query Optimization, Hyper Graph Approach. International Journal of Data Warehousing and Mining 11(2) (2015) 62-84. [13] Naila Karim, Khalid Latif, Zahid Anwar, Sharifullah Khan, Amir Hayat, Storage schema and ontology-independent SPARQL to HiveQL translation, The Journal of Supercomputing 71(7) (2015) 2694-2719.

15

Author Biography

Bin Zhang received his PhD degree in Computer Science from Donghua University, Shanghai, in 2017. He is currently a post doctorate the School of Computer Science at Fudan University, Shanghai. His research interests include Database Theory, Data Mining, Parallel and Distributed Systems, Computer System Architecture and Web-related Technologies.

Xiaoyang Sean Wang earned his PhD degree in Computer Science from the University of Southern California in 1992. Before that, he studied for and obtained BS and MS in Computer Science from Fudan University, Shanghai, China. He joined the School of Computer Science at Fudan University in 2011 as Dean & Professor. His research interests include Database Management Systems, Information Security and Privacy, Wireless Sensor Networks, Streaming Data Processing Time Series Queries, Data Mining and Data Warehousing, Database Theory. In recent years, his work group has focused on the study of Big Data, and has published more than 100 papers in some important academic journals. Now, He is a senior member of ACM and Chinese Computer Federation (CCF).

16

Highlights 1) A new recurring queries model for MapReduce was proposed. 2) Data scheduling in recurring queries powered by MapReduce late scheduling and consistent window slice was studied. 3) Efficient MapReduce reuse strategy in recurring queries was presented with the model. 4) Four different optimization algorithms of recurring queries were compared with other traditional processes by experimental results.

17