J. Parallel Distrib. Comput. 121 (2018) 42–52
Contents lists available at ScienceDirect
J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc
An efficient theta-join query processing in distributed environment Wenjie Liu *, Zhanhuai Li School of computer, Northwestern Polytechnical University, ShaanXi, Xi’an, 710072, China
highlights • Effective Max and Min values based filter strategy for theta-join computing in distributed environment. • Divide and Merge method for theta-join which reduces network overheads greatly. • Extensive experiments using real world and synthetic data sets.
article
info
Article history: Received 12 March 2016 Received in revised form 3 April 2018 Accepted 5 July 2018
Keywords: Parallel distributed framework Theta-join algorithm Query optimization Large scale data processing
a b s t r a c t Theta-join query is very useful in many data analysis tasks, but it is not efficiently processed in distributed environment, especially in large scale data. Although there is much progress in dealing theta-join with MapReduce paradigm, the methods are either complex which require fundamental changes to MapReduce framework or only consider the overheads of load balance in the network, when data scale is large, they will make much computation cost and induce OOM (Out of Memory) errors. In this work, we propose a filter method for theta-join on the purpose of reducing the computation cost and achieving the minimum execution time in distributed environment. We consider not only the load balance in the cluster, but also the memory cost in parallel framework. We also propose a keys-based join solution for multi-way theta-join to reduce the data amount for cross product, then improve the performance of join efficiency. We implement our methods in a popular general-purpose data processing framework, Spark. The experimental results demonstrate that our methods can significantly improve the performance of theta-joins comparing with the state-of-art solutions. © 2018 Elsevier Inc. All rights reserved.
1. Introduction Large scale data processing related to analytic queries involves theta-join operations, which is becoming one of the most important challenges in recent years. Theta-join is defined as a binary function θ which belongs to {<, ≤, =, ≥, >, <>}. MapReduce is a prevalent framework to process large scale data in parallel, and can be used to process join operations. Due to the inherent limitations of key-equal feature, it can be easily used to support equal joins, but cannot be directly used for theta-joins. Recently, there are a few works that focus on processing thetajoin in MapReduce framework. 1-bucket-theta is a method proposed to evaluate one single theta-join in one MapReduce Job [11]. Its main idea is to balance the workload among reducers. It partitions the cross-product results of two input tables for thetajoin with many rectangle regions of bounded size. The records in one region are distributed to one reducer, as each region includes almost same amounts of data, each reducer will receive
* Corresponding author.
E-mail addresses:
[email protected] (W. Liu),
[email protected] (Z. Li).
https://doi.org/10.1016/j.jpdc.2018.07.007 0743-7315/© 2018 Elsevier Inc. All rights reserved.
the average workload, thus the parallelism of the system can be achieved. A multi-way theta-join is also processed by using 1bucked-theta [18]. The method implements a chain-typed thetajoin by using Hilbert curve. As Hilbert curve requires same scale for data sets, it cannot be used for join tables with different size. A randomized algorithm named Strict-Even-Join (SEJ) is designed to solve the multi-way theta-joins in a single MapReduce job [19]. It uses Lagrangian method to compute the approximate fragments of each relation and minimizes the communication cost between map and reduce phases. But the partition idea is as the same as 1-bucket-theta. To reduce the high I/O cost of intermediate results, a method which uses just two MRJs (MapReduce Jobs) to implement multi-way theta-joins in MapReduce has been proposed, multi-way theta-join is decomposed into a non-equal-join and a multi-way equal-join. But in each theta-join processing, it is also processed by 1-bucket-theta [15]. The latest work about theta-join is based on sorting, permutation-arrays and bit-arrays. It puts columns to be joined in sorted arrays and uses permutation arrays to encode positions of tuples in one sorted array w.r.t. the other sorted array [9]. This method needs fundamental changes of parallel frameworks, so it cannot be easily used in real applications.
W. Liu, Z. Li / J. Parallel Distrib. Comput. 121 (2018) 42–52
From above works, we notice that 1-bucket-theta is the basis for most of theta-join and multi-way theta-join processing, it is widely used in many algorithms. Although it is useful and efficient, it is only considering the load balance among reducers, and not considering how to reduce the computation cost. When input tables are very large, it will gain low efficiency. In our paper, we propose a method which is also based on 1-bucket-theta, but do an efficient filter to prune much irrelevant records before cross product. We propose a key-based filter method which uses only the join attributes. The method is called max and min values based filter strategy (MMF method). We use the strategy to filter the input data records before we do cross product, which is widely used in processing theta-joins. As known to all, cross product is a time-consuming operation, especially for large data in a parallel framework, too many intermediate results may make OOM errors occur. So deleting useless records which do not contribute to the final results will reduce the memory cost and copied data transferred in network. We also apply our method into multi-way theta-join and design a key-based join strategy. We implement our methods and the baseline method of 1-bucket-theta in a prevalent distributed data processing framework-Spark, and conduct extensive experiments on large scale real and synthetic data sets. The results show our method outperforms 1-buckettheta for theta-join queries. The contributions of our work are summarized as follows:
• We proposed a max and min values based filter method for computing theta-join in a distributed framework. It uses the max and min values of the join attributes to filter useless data which are irrelevant to the final results from the input data sets, and then uses the idea of 1-bucket-theta to evenly partition the cross product results among reducers to achieve load balance from the parallel framework. It effectively reduces the memory cost and network overhead, and therefore improves the join efficiency. • We propose a new join strategy for multi-way theta-join, which does the cross product only on the join attributes, then use equi-join to merge other output attributes, which greatly reduced the intermediate results of cross product and improved computation cost. • We compare our methods with the baseline method 1bucket-theta and SQL-based query engine in a distributed framework, the results show that our solution is more feasible and effective. The remainder of the paper is organized as follows. In Section 2, we briefly review the MapReduce computing paradigm and the solution for binary theta-join. Section 3 introduces preliminary concepts and definition. Section 4 gives a presentation of the proposed approach, Section 5 analyzes the cost for binary and multiway theta-join, then proposes the optimized versions of these two algorithms, experiments evaluation is described in Section 6. We discuss related work in Section 7 and conclude our work in Section 8. 2. Theta-join in MapReduce In this section, we first present the MapReduce paradigm and how joins will be evaluated on it, then we briefly review the basic idea of 1-bucket-theta method and point out the limitations of it. 2.1. Joins in MapReduce MapReduce [6] is a popular parallel computation framework, in which data are expressed as (key, value) pairs. It includes two main functions which are Map and Reduce. The map function transforms the input pair (k1 , v1 ) to an output of (k2 , v2 ). The output will be
43
Fig. 1. The idea of 1-bucket-theta.
partitioned by hashing function to different reducers, then each reducer will take the input (k2 , list(v2 )) and perform user specified computation to reduce the values of k2 then output the final results. Joins in MapReduce includes equi-join and non-equi-join (called theta-join). Equi-join is easy to implement because MapReduce is a key–value based programming model, whose nature is key-equality and can join data sets on the keys with high performance. But as to theta-join, due to inherent limitations, there exists many problems such as load balance, data skew, and memory shortage. 1-bucket-theta is an effective way to solve binary thetajoin in MapReduce framework, it uses a randomized algorithm to partition the cross product results to different reducers, which ensures that each reducer can process near same amount of data. Here we give a brief review of 1-bucket-theta. 2.2. 1-bucket-theta The method of 1-bucket-theta builds up a theta-join model between two data sets S and R with a join matrix M, which can represent and implement any theta-join queries. The data sets that will be joined have |S | and |R| records, and the cross product has |R|×|S | records. There are k reducers in the parallel framework, the method uses randomized algorithm to partition the cross product to k squares and each square is distributed to one reducer. The side length l of square is as follows: l=
√
|R| × |S |/k
(1)
Fig. 1 shows idea of 1-bucket-theta. In Fig. 1, three reducers will receive 12 tuples, respectively, and each reducer joins these records and filters them according to the join condition, such as R.B < S.B. 1-bucket-theta requires minimal statistical information, which are the cardinalities of the input tables. The idea of 1-bucket-theta is from the load balance point of view to achieve minimized maximum reducer input. But the tuples processed in each reducer may contain useless pairs that does not contribute to the final results. If we can filter part of the useless pairs in the input data sets before we perform the algorithm, we can achieve better efficiency. 3. Preliminary In this section, we formally define the binary theta-join and multi-way theta-join query problems. To improve the multi-way theta-join query efficiency, we divide it into two kinds according to the output attributes.
44
W. Liu, Z. Li / J. Parallel Distrib. Comput. 121 (2018) 42–52
3.1. Theta-join query problem Definition 1 (Binary Theta-join Query Problem). Assume that there are two relations R(A,B) and S(B,C), function θ belongs to {<, ≤, ≥ , >}, Query QB is defined as R(A,B)▷◁R.Bθ S .B S(B,C), then QB is called a binary theta-join query. Here we focus on a binary theta-join and restrict our discussion of function θ to {<, ≤, ≥, >}. In this definition, the data sets R and S have join attribute B, and also have their own attributes A and C. It is a general purpose definition, and the attributes in each data set do not limit two, the join attributes can also be more than two. In MapReduce paradigm, we usually make join attribute as the key, and make all the other attributes merged as values, which is like the style of a pair. Theta-join query cannot be easily implemented in MapReduce framework, for example, we assume query QB = R(A,B)▷◁R.B>S .B S(B,C), attribute B cannot be used as a key because each tuple (a,b) in R has to be joined not only with the tuple (b,c) in S, but also joined with all the tuples (bi , ci ) in S, where bi < b. Definition 2 (Multi-way Theta-join Query Problem). Assume that there are n data sets (n > 2), which are R0 , R1 , . . . , Ri ..., Rn−1 (2 < i < n), n − 1 functions θ0 , θ1 , . . . θn−2 belong to {<, ≤, ≥, >}, Query QM is defined as R0 (A, B) ▷◁R0 .Bθ0 R1.B R1 (B, C ) ▷◁R1 .C θ1 R2.C R2 (C , D)...Rn−2 (X , Y ) ▷◁Rn−2 .Y θn−2 Rn−1 .Y Rn−1 (Y , Z ), then QM is called a multi-way theta-join query problem. Multi-way theta-join query often involves more than two data sets, and it can be viewed as multiple binary theta-joins, so we can decompose the multi-way theta-join into multiple binary thetajoins and join them one by one. In a common MapReduce environment, such as Hadoop, each join needs a individual MapReduce job and has to output the intermediate results to the HDFS (Hadoop Distributed File System), which results in high I /O cost. In other MapReduce frameworks which support iterative computation, such as Twister [8], Haloop [4] or Spark [17], intermediate results will be kept in the distributed sharing memory for next step, so I /O cost is reduced but memory cost is added. In our work, we implement our algorithms on Spark, so we focus on how to reduce the memory cost and improve the computation efficiency. Definition 3 (Keys only Multi-way Theta-join Query Problem). QK is a multi-way theta-join query, for each theta-join in it, if only the join attributes are output, then QK is called a keys only multi-way theta-join query problem. Keys only multi-way theta-join focuses only the join attributes and neglects other information in the data sets. For example, assume that there are three tables, which are R(A,B), S(B,C) and T(C,D), a keys only multi-way query is as follows: Q1 : Select R.B, S.B, S.C, T.C From R, S, T Where R.B > S.B and S.C < T.C; In above query, all the output attributes in Select clause are the join attributes in the Where clause, which will become the keys of the key–value pair in a MapReduce framework, so we call it a keys only multi-way theta-join. If a query is like this, we can directly use the join attributes to do cross product and in a key–value pair of each data set, neither the key nor the value can be divided again. Of course, not all the join attributes must be output, but each output attribute must be join attribute. Definition 4 (Not only Keys Multi-way Theta-join Query Problem). QNK is a multi-way theta-join query, for one of theta-joins in it, if not only the join attributes are output, then QNK is called a not only keys multi-way theta-join query problem.
This kind of query needs more information from the tables, so output is not limited in the join attributes. We give another example based on the tables in Definition 3. Q2 : Select R.A, R.B, S.B, T.C,T.D From R, S, T Where R.B > S.B and S.C < T.C; The only difference between query Q1 and Q2 is the output attributes, Q2 outputs more attributes which are R.A and T.D, and these two attributes do not appear in the Where clause. This query is close to a real query in practice. In most cases, we need not only the join attributes, but also other information to help us to judge. For example, if we want to know the course information which a student selected, we want not only the course id, but also the course name. Sometimes we want the student’s class information and etc, which is really needed in practice. For a not only keys multi-way theta-join, both the attributes in the Select clause and Where clause need to be extracted into the form of key–value pairs. The join attribute in one data set which is compared with other data set will become the key and left attributes will be combined to form the value. After the first theta-join, the intermediate results will be a nested pair, we can also extract the join attribute from it as key, and then join with the third data set. For example, in Q2 , in the first time, we do the cross product on = PR (B, A) and PS (B, C ), and get PRS ((BR , BS ), (AR , CS )), then do the second cross product on PRS (CS , (AR , BR , BS )) and PT (C , D). (Here PR (B, A) represents the key–value pair collection which is transformed from data set R(A,B) and takes join attribute B as the key. S and T are transformed in the similar way.) The reason why we divide the multi-way theta-join into two kinds is that we will design a keys-based join strategy for query QNK in the later section and this new strategy can greatly improve the join efficiency of it. 4. Filter strategy and theta-join algorithm In this section, we first describe the idea of max and min values based filter strategy, then present our binary theta-join algorithm and multi-way theta-join algorithm on the basis of this filter strategy. 4.1. Max and min values based filter strategy In this subsection, we give an example of theta-join on two tables, and describe the idea of max and min values based filter strategy. Assume that R(A,B) and S(B,C) are two tables and join attribute is B, we consider the following query Q3 : Q3 : Select R.B, S.B From R, S Where R.B > S.B; Fig. 2 shows an example of join attribute of R and S. In this example, there are 5 records from R and 2 records from S on attribute B, and the output tuples which satisfy the join conditions are two. If we directly perform a cross join on the two tables, there are 5*2= 10 tuples which should be processed among reducers. Actually in table R, there are just two records which contribute to the final results (4 and 5), and just 1 record in S (3). If cross join is only done on these 3 records, the computation cost and communication cost will be reduced. We observe the data in R and S, the max and min values are 5 and 1 in R, 5 and 3 in S. If the data in R satisfies the join condition R.B > S.B, the data in R which are greater than the min value in S must produce at least 1 final result, so these data should be kept.
W. Liu, Z. Li / J. Parallel Distrib. Comput. 121 (2018) 42–52
45
procedure is from line 7 to line 26, the two data sets are filtered according to Theorem 1 we have proposed in the last subsection. This strategy just uses the max and min values of join attributes from one table to filter another table, so it does not need to compare the entire two tables to find the records which satisfy the join condition, and therefore improve the join efficiency. It guarantees that all the data kept contribute at least one final result. Algorithm 1 Max and Min Values based Filter Strategy. Fig. 2. Max and Min values based filter strategy.
That is, ∀x ∈ R, if (x > S .Bmin ), x should be kept, otherwise it should be filtered. For example, we check all the data in R to find those which are greater than 3 (min value in S), we get 4 and 5. Next, we observe the data in S, if the data satisfy the condition of R.B > S.B, which can also be denoted as S.B < R.B, then the data in S which are less than the max value in R must produce at least 1 final result. That is, ∀y ∈ S, if (y < R.Bmax ), y should be kept, otherwise it should be filtered. For example, we check all the data in S to find those which are less than 5 (max value in R), we get 3. From above observations, we get two values in R (4,5) and one value in S (3), then we can do a cross product on these values and output the final results, which are (4,3) and (5,3). The computation cost is reduced from 5*2=10 to 2*1=2. The observations above are used for the join condition R.B > S.B, when the operator changes, i.e. θ ∈ {<, ≥, ≤}, the filter rule also changes. Here we give the following theorem to describe filter rules for different θ functions. Theorem 1. Given two tables R(A,B) and S(B,C), join attribute is B, when function θ = ‘‘ >’’, then ∀x ∈ R.B, if (x > S .Bmin ), x should be kept otherwise filtered; ∀y ∈ S .B, if (y < R.Bmax ), y should be kept otherwise filtered. When function θ = ‘‘ <’’, then ∀x ∈ R.B, if (x < S .Bmax ), x should be kept otherwise filtered; ∀y ∈ S .B, if (y > R.Bmin ), y should be kept otherwise filtered. When function θ = ‘‘ ≥’’, then ∀x ∈ R.B, if (x ≥ S .Bmin ), x should be kept otherwise filtered; ∀y ∈ S .B, if (y ≤ R.Bmax ), y should be kept otherwise filtered. When function θ = ‘‘ ≤’’, then ∀x ∈ R.B, if (x ≤ S .Bmax ), x should be kept otherwise filtered; ∀y ∈ S .B, if (y ≥ R.Bmin ), y should be kept otherwise filtered. Proof. θ = ‘‘>’’: If there exists one x ∈ R.B and x ≤ S .Bmin , then ∀y ∈ S .B, we have x ≤ y, this contradicts to the join condition ‘‘>’’, so x should be filtered. Similarly, if there exists one y ∈ S .B, and y ≥ R.Bmax , then ∀x ∈ R.B, we have y ≥ x (that is x ≤ y), this is also contradictory to the join condition ‘‘>’’, so y should be filtered. □ The proofs for θ = ‘‘<, ≥, ≤ ’’ are similar to that of θ = ‘‘>’’, so we omit them here. The above theorem helps us to find out the records which contribute to the final results from the input tables according to different theta-join conditions. Here we give the algorithm implementation of max and min values based filter strategy (we use scala style). Algorithm 1 receives two data sets as its input, which have been stored in HDFS, and transforms them into key–value pairs collections, then extracts two new pairs collections Rkey and Skey from original pairs. The new pairs collections are the join keys of R and S. After sorting of them, we compute the max and min values of Rkey and Skey , which are used in the filter strategy. The filter
Input: data sets R(A,B) and S(B,C), function θ ; Output: filtered pairs collections PR (A,B) and PS (B, C );
1: PR (A, B) = transform R(A,B) into key–value pairs; 2: PS (B, C ) = transform S(B,C) into key–value pairs; 3: Rkey = PR (A, B).map{x(A, B) => x.B}; 4: Skey = PS (B, C ).map{y(B, C ) => y.B}; 5: sort Rkey , Skey with ascending order; min ; , S max , Skey , Rmin 6: compute Rmax key key key 7: PR (A, B) = PR (A, B).filter {x(A, B) => 8: if θ = ‘‘ > " then min 9: x.B > Skey 10: else if θ = ‘‘ ≥ " then min 11: x.B ≥ Skey 12: else if θ = ‘‘ < " then max 13: x.B < Skey 14: else if θ = ‘‘ ≤ " then max 15: x.B ≤ Skey 16: end if 17: PS (B, C ) = PS (B, C ).filter {y(B, C ) => 18: if θ = ‘‘ > " then 19: y.B < Rmax key 20: else if θ = ‘‘ ≥ " then 21: y.B ≤ Rmax key 22: else if θ = ‘‘ < " then 23: y.B > Rmin key 24: else if θ = ‘‘ ≤ " then min 25: y.B ≥ Rkey 26: end if} 27: return PR (A, B), PS (B, C );
Assume that there are m records in R and n records in S, the time complexity of comparing two tables is O(m × n), but Algorithm 1 just needs O(m + n) to find the required records. The space cost of it is also O(m + n). This filter strategy (we call it MMF method) can be directly used into binary theta-join before cross product, but for a multi-way theta-join, it should be decomposed into multiple binary thetajoins, then uses this strategy one by one. When the two data sets R and S are in the same values range, we should first split the data set R into two sub ranges (almost same size), and filter each sub range and S by MMF method, then do join operation separately and merge join results of sub ranges and S, which can improve the MMF performance. Next we will apply this filter strategy into binary theta-join and multi-way theta-join. 4.2. Binary theta-join algorithm The way to implement a binary theta-join in a distributed environment should not only consider the overhead of load balance, but also the memory cost and communication cost, especially in the iterative MapReduce framework, because huge memory cost will induce OOM errors. Reducing the data before time-consuming operation is very important for saving computation amounts and copied data in the network. Our solution for binary theta-join is based on 1-bucket-theta, which can achieve optimal load balance in the cluster, but the implementation is different in iterative MapReduce environment. Fig. 3 illustrates the procedure of our solution for binary theta-join.
46
W. Liu, Z. Li / J. Parallel Distrib. Comput. 121 (2018) 42–52
4.3. Multi-way theta-join algorithm
Fig. 3. MMF based binary theta-join procedure.
In an iterative MapReduce environment, as the results for each step can be kept in memory, we can use them as input for next step’s computation. This makes a multi-way theta-join query able to be treated as multiple binary theta-joins. The direct way to solve it is to join in sequence. Assume that there are three data sets R(A,B), S(B,C) and T(C,D), a multi-way theta-join is like R(A, B) ▷◁R.Bθ0 S .B S(B, C ) ▷◁S .C θ1 T .C T (C , D). We can join R and S by using our MMF based binary theta-join, then the join results of R and S will be joined with T in the same way. Algorithm 3 gives an implementation of this example. Algorithm 3 MMF based Multi-way Theta-join.
In Fig. 3, input data sets R(A,B) and S(B,C) should be filtered by MMF method first, then two pairs collections RR (A, B) and PS (B, C ) are output. We notice that the join attribute B is not in the key’s position of pairs RR (A, B), so we map it to the form of RR (B, A). Next we do a cross product on these two pairs collections and repartition the results to different reducers randomly to make a balance workload for them. The difference of our solution and 1-bucket-theta is that we prune away useless records before cross product according to the theta-join condition by our MMF method, and MMF method just needs statistics information of max and min values of join attribute, so it can save much computation time. The results comparison will be illustrated in the experiments part. The algorithm implementation of our binary theta-join is as follows. Algorithm 2 MMF based Binary Theta-join. Input: data sets R(A,B) and S(B,C), function θ , reducer number k ; Output: theta-join results collection PRS ((B, A), (B, C )) which satisfies the join condition θ ; 1: PR (A, B) = filtered R(A,B) by MMF method; 2: PS (B, C ) = filtered S(B,C) by MMF method; 3: PR (A, B) = PR (A, B).map{x(A, B) => x(B, A)}; 4: PRS ((B, A), (B, C )) = PR (B, A).cartesian(PS (B, C )); 5: PRS ((B, A), (B, C )).repartiton(k); 6: PRS ((B, A), (B, C )) = PRS ((B, A), (B, C )).filter { 7: x((B, A), (B, C )) => 8: if θ = ‘‘ > " then 9: x.key.B > x.v alue.B 10: else if θ = ‘‘ ≥ " then 11: x.key.B ≥ x.v alue.B 12: else if θ = ‘‘ < " then 13: x.key.B < x.v alue.B 14: else if θ = ‘‘ ≤ " then 15: x.key.B ≤ x.v alue.B 16: end if} 17: return PRS ((B, A), (B, C ));
In algorithm 2, after being filtered by MMF, two pairs collections PR (A, B) and PS (B, C ) will do a cross product by the operation of ‘‘cartesian’’, and a new nested pairs collection PRS ((BR , A), (BS , C )) is produced. The cross product results will be evenly distributed to k different reducers by ‘‘repartition’’ operation. The form of PRS ((BR , A), (BS , C )) consists two parts, key (BR , A) and value (BS , C ), which are also in the form of a pair, and join attribute B is at the position of key’s value of these two parts. Although we have reduced the input data amount, there also exists records which do not satisfy the join condition, so we will continue to do a filter on PRS ((BR , A), (BS , C )) according to the function θ to pick up the required results. The returned results contain all the attributes of two data sets, and we can map it to any attribute according to the query. For example, if we just need R(B,A) and S(B), we can map PRS ((BR , A), (BS , C )) to the form of PRS ((BR , A), BS ). As there is a cross product in this algorithm, so the time complexity of it is O(m′ × n′ ), which assumed m′ records in filtered R and n′ records in filtered S.
Input: data sets R(A,B), S(B,C) and T(C,D), function θ0 , θ1 , reducer number k ; Output: theta-join results collection PRST (((B, A), (B, C )), (C , D)) which satisfies the join condition θ0 , θ1 ; 1: PR (A, B) = filtered R(A,B) by MMF method; 2: PS (B, C ) = filtered S(B,C) by MMF method; 3: PRS ((B, A), (B, C )) = compute binary theta-join according to Algorithm 2(θ0 ); 4: PRS ((B, A), (B, C )) = filtered PRS ((B, A), (B, C )) by MMF method; 5: PT (C , D) = filtered T(C,D) by MMF method; 6: PRST (((B, A), (B, C )), (C , D)) = compute binary theta-join according to Algorithm 2θ1 );
7: return PRST (((B, A), (B, C )), (C , D));
In Algorithm 3, we first decompose the multi-way theta-join into two binary theta-join, and join them in sequence. The intermediate results for first theta-join is a nested pairs collection PRS ((BR , A), (BS , C )), when we use MMF method, we should first map it to the form of PRS (CS ), and use it to filter data set T(C,D). When filtering PRS ((BR , A), (BS , C )), we should also get the form of PT (C ) from PT (C , D), and use it to compare with the C attribute in PRS ((BR , A), (BS , C )). We omit the details here and just give the idea of algorithm implementation. Assume that there are m records in R, n records in S, and u records in T, the filtered records are m′ , n′ , u′ , respectively (m′ < m, n′ < n, u′ < u), then the time complexity our multi-way thetajoin is less than O(m′ × n′ × u′ ), because the intermediate results for the first theta-join are also filtered by MMF method. Algorithm 3 shows the theta-join procedure for three tables, when the input tables are more than 3, we can continue to do the above procedure to get final results. Although we have reduced the input data amount for multiway theta-join, computation cost of cross product for more than two tables is still very large, which may make OOM errors. We notice in Algorithm 3, all the attributes of three tables have taken part in the computation of cross product, which consumes much memory space. If we can reduce the data amount for cross product for a multi-way theta-join, query efficiency will be improved. In Section 3, we have divided the multi-way theta-join into two kinds according to the output attributes. For the keys-only multi-way theta-join, we can just use keys to do cross product and output them directly. For a Not only keys multi-way theta-join, we can divide the attributes in each data set, and pick up the keys to do cross product and then filter them according to join attribute, then we can merge the filtered records with other output attributes. So in the next section, we will introduce some optimization techniques for our algorithms. 5. Optimization In this section, we present some optimized techniques we used to improve our binary and multi-way theta-join algorithms. First we analyzed the cost of these two algorithms and pointed out the bottleneck of them, then we proposed two optimized algorithms which aim at reducing the computation cost and memory cost.
W. Liu, Z. Li / J. Parallel Distrib. Comput. 121 (2018) 42–52
47
5.1. Cost analysis In 1-bucket-theta, the optimization goal is to minimize job completion time for a given number of processing nodes, and it is generalized to a load balance problem. As it is implemented in a common MapReduce environment, the Map and Reduce functions affect the costs from data input and output to the DFS (Distributed File System). It uses a join-matrix as its theta-join model and tries to map join-matrix to reducers to minimize the job completion time. In an iterative MapReduce environment, our goal is also to minimize the job completion time, but we need not consider the problems like how many jobs should we use, how to read data from last job, which have been done by the framework, we focus on the problems of how to reduce the computation cost, and how to reduce the intermediate results which will be shuffled in the network, so as to enhance the job execution efficiency. From Algorithm 2, we can see the cost for a binary thetajoin consists of four parts: MMF time to filter useless pairs, cross product time to produce join-matrix, the repartition time to reshuffle the cross product results to different reducers, and the filter time to pick up records which satisfy the join conditions. The cost model can be denoted as follows: Tbinary = TMMF + Tcartesian + Trepartition + Tfilter
(2)
Tcartesian = Tm×n
(3)
In Eq. (2), TMMF , Trepartition and Tfilter cannot be reduced because each operation is to ensure to produce the least output. TMMF ensures the least input data sets, Trepartition is to maximize each reducer’s input to ensure load balance, and Tfilter is to do an exact check to ensure each record in the final results satisfying the join conditions. So these three parts cannot be optimized. Cross product time is represented by Eq. (3), which is to produce the whole join-matrix(m records from R and n records from S). Although its input data is from MMF output, the cross product results will also consume large memory. So our optimization goal is still focusing on how to reduce the data amount for cross product. A multi-way theta-join contains multiple binary theta-join, so its cost model can be formulated as follows: Tmulti−way =
k ∑
i (Tbinary )(1 ≤ i ≤ k, k > 2)
(4)
i=1
In Eq. (4), a multi-way join cost model consists of k binary theta-joins. Although we have optimized join efficiency for binary theta-join, when data sets become larger, doing cross product on multiple tables will also produce huge memory cost. Actually, we need not let all the attributes to take part in this operation, the join attribute is enough. Assume in a binary theta-join, ten attributes should be output, and five from each data set. The data size of each attribute is 1M, then the cross product for these two tables will consume 5M × 5M = 25G memory. If we only use keys to do it, the memory consumption is reduced to 1M × 1M = 1G, which greatly improves the memory cost, and therefore enhances the job execution efficiency. So in the next subsection, we will introduce the optimization versions of these two algorithms. 5.2. Optimized binary theta-join algorithm We have divided the multi-way theta-join into two kinds in Section 3 according to the output attributes, the same idea can be applied into binary theta-join. We do not emphasize this for binary theta-join because the cross product for its memory consumption is much less than multi-way theta-join. But when data sets are
Fig. 4. DM method for binary theta-join.
more than two, the memory cost cannot be ignored because OOM error may occur. Considering a binary theta-join from table R(A,B) and S(B,C), the query is like the following: Q4 : Select R.A, R.B, S.B, S.C From R, S Where R.B > S.B; For above query, output attributes include not only join keys (R.B, S.B), but also other attributes (R.A and S.C). If we use algorithm 2 to compute, the cross product will involve four attributes. All the output attributes will take part in the computation of cross product and take up memory. Actually, the role of cross product is to produce the whole join matrix which can be used to pick up records according to join condition. If only the join attributes (R.B and S.B) take part in the computation, the data amount for cross product can become less. After picking up the join attributes according to join condition, we will get a subset of cross product. To be specific, if the collection of join matrix is M(R.B,S.B), and the collection of records satisfying the join condition is V(R.B,S.B), then V ⊆ M. To get other output attributes (R.A and S.C), we can perform two equi-join. First, join V(R.B,S.B) with R(B,A) to get attribute R.A. Second, assume that the result for the first step is P(R.B,(S.B,R.A)), we transform it into the form of P(S.B,(R.B,R.A)) and join it with S(B,C), then we get pairs collection of P(S.B,((R.B,R.A),S.C)), which includes all the output attributes. We name this method as Divide and Merge method (DM). Fig. 4 illustrates the idea of DM method for query Q4 . In Fig. 4, data sets R and S become the pairs collection after MMF filtering, then we just map them into keys only form (RR (B), RS (B)). We do a cross product on RR (B) and RS (B), and filter them by join condition R.B > S .B. Next we do two equi-join to get the output pairs collection P(S .B, ((R.B, R.A), S .C )). As equi-join in MapReduce environment requests same keys from different data sets, we have to do some transformation of the pairs collection to let them have same keys. The advantage of DM method is that we use the least attributes to do cross product, and filter them by join condition, thus we get a subset of join matrix which contains only the records for output. Then we use equi-join on this subset and the pairs collection filtered by MMF, we can get the whole output attributes. Both the input records for cross product and equi-join are reduced to minimum memory cost, the theta-join efficiency is improved. The improved binary theta-join algorithm is as algorithm 4 shows. The difference between algorithm 4 and algorithm 2 is from Line 3 to Line 23. Algorithm 2 computes one cross product on all the output attributes and algorithm 4 uses one cross product on keys and two equi-join on filtered records. Although we add two
48
W. Liu, Z. Li / J. Parallel Distrib. Comput. 121 (2018) 42–52
Algorithm 4 DM based Binary Theta-join. Input: data sets R(A,B) and S(B,C), function θ , reducer number k ; Output: theta-join results collection PRS ((B, A), (B, C )) which satisfies the join condition θ ;
1: PR (A, B) = filtered R(A,B) by MMF method; 2: PS (B, C ) = filtered S(B,C) by MMF method; 3: PR (B) = PR (A, B).map{x(A, B) => x(B)}; 4: PS (B) = PS (B, C ).map{y(B, C ) => y(B)}; 5: P(BR , BS ) = PR (B).cartesian(PS (B)); 6: P(BR , BS ).repartiton(k); 7: P(BR , BS ) = P(BR , BS ).filter { 8: x(BR , BS ) => 9: if θ = ‘‘ > " then 10: x.BR > x.BS 11: else if θ = ‘‘ ≥ " then 12: x.BR ≥ x.BS 13: else if θ = ‘‘ < " then 14: x.BR < x.BS 15: else if θ = ‘‘ ≤ " then 16: x.BR ≤ x.BS 17: end if} 18: PR (B, A) = PR (A, B).map{x(A, B) => x(B, A)} 19: P(BR , (BS , A)) = P(BR , BS ).join(PR (B, A)) 20: P(BS , (BR , A)) = P(BR , (BS , A)).map{ 21: x(BR , (BS , A)) => x(BS , (BR , A))} 22: P(BS , ((BR , A), C )) = P(BS , (BR , A)).join(PS (B, C )) 23: return P(BS , ((BR , A), C ));
Fig. 5. DM method for multi-way theta-join.
equi-join operations comparing to algorithm 2, we avoid the huge memory cost of cross product on all the output attributes, which may induce OOM errors. For equi-join, we just use a subset of the cross product and the records filtered by MMF, so the input data amount is reduced to minimum, and the computation cost and memory cost of algorithm 4 is reduced. 5.3. Optimized multi-way theta-join algorithm
T). In the first DM binary theta-join, output pairs collection is P(S .B, ((R.B, R.A), S .C )), we can map it to the form of P(S .C ), and map PT (C , D) to the form of P(T .C ), which includes only the keys, then we do cross product on these two pairs collections and filter them by join condition (S .C < T .C ). The filtered results will do two equi-join on P(S .B, ((R.B, R.A), S .C )) and PT (C , D) to get final results. The detail of the procedure is illustrated as in the following algorithm. Algorithm 5 DM based Multi-way Theta-join. Input: data sets R(A,B), S(B,C) and T(C,D), function θ0 , θ1 , reducer number k ; Output: theta-join results collection P(T .C , (((R.B, R.A), S .B), T .D)) which satisfies the join condition θ0 , θ1 ; 1: PR (A, B) = filtered R(A,B) by MMF method; 2: PS (B, C ) = filtered S(B,C) by MMF method; 3: PT (C , D) = filtered T(C,D) by MMF method; 4: P(S .B, ((R.B, R.A), S .C )) = compute DM binary theta-join according to Algorithm 4(θ0 );
5: P(T .C , (((R.B, R.A), S .B), T .D)) = compute DM binary theta-join according to Algorithm 4(θ1 );
6: return P(T .C , (((R.B, R.A), S .B), T .D));
In algorithm 5, as output attributes do not contain S .C , we remove it from the output collection by mapping operation. The intermediate result is P(S .B, ((R.B, R.A), S .C )), by mapping it to P(S .C , ((R.B, R.A), S .B)), it can be joined with P(S .C , T .C ), which is the second cross product result, then we get P(S .C , (((R.B, R.A), S .B), T .C )), we continue to map it to the form of P(T .C , (((R.B, R.A), S .B), S .C )), as S .C does not appear in the output attributes, we remove it and get P(T .C , ((R.B, R.A), S .B)) and join with P(T .C , T .D), then we get the final pairs collection P(T .C , (((R.B, R.A), S .B), T .D)). The difference between algorithm 5 and algorithm 3 is that all the cross product operations contain just join keys attributes, which uses the least memory for theta-join computation. That is, we divide the attributes into keys and non-keys, after filtering much useless records, we merge the keys with non-keys attributes. That is why the method is called Divide and Merge method. 6. Experiment evaluation We implemented various version’s algorithms for binary thetajoin and multi-way theta-join. For binary theta-join, the algorithms include the base line method of 1-bucket-theta and Spark-SQL, MMF method we proposed for binary theta-join, the optimized binary theta-join (DM method). For multi-way theta-join, the algorithms include the base line method of Spark-SQL, MMF based method and DM based method. We also scale our test to different size of data amount to test the scalability of our algorithms. We focus on the performance of different algorithms. 6.1. Experiment setup
As multi-way theta-join can be divided into multiple binary theta-join, DM method proposed in last subsection can also be used in it. Keys only multi-way theta-join just outputs join keys, so it need not join other attributes for output. It just needs one cross product to construct keys pairs, then filter them by join condition. This kind of query cannot be optimized because the input for cross product is the least. Actually, this kind of query is very rare in practice, most of the queries are Not only Keys queries, so we focus on the optimization this kind queries. In order to use the DM method in a Not only Keys Multi-way theta-join, we should also decompose it into several binary thetajoins, and use DM method one by one. The procedure is as Fig. 5 shows. Fig. 5 illustrates the multi-way theta-join procedure for the query Q2 in Section 3, we can see two DM binary theta-joins are used to compute the theta-join for three tables (R, S and
Environment. All experiments are evaluated on a 16-node cluster. Each node has 2 6-core Intel Xeon X5650 processors with 48 GB of RAM and 1 TB hard disk. We install Red Hat Linux 4.7.0–4 and Java 1.6 with a 64-bit server JVM on each node. Spark version installed is compiled version of 1.2.0 with hadoop 1, using cluster mode with one master node and 15 slave nodes. Hadoop version is 1.2.1. We run each algorithm 10 times and report the average execution time. Data sets. We use both synthetic data set and real data set to test our approach. The synthetic data set we employed is a data set which is designed for bench queries (TPC-H) [14], which includes 8 tables and 22 queries. We use the open source tool of DBGEN to generate different size of data amount, and design queries for binary theta-join and multi-way theta-join on the basis of TPC-H tables.
W. Liu, Z. Li / J. Parallel Distrib. Comput. 121 (2018) 42–52
49
Table 2 Output size for different n, (θ = ‘‘ > ’’).
Table 1 Output size for different n, (θ = ‘‘ < ’’). Size of n
n=5
n=10
n=15
n=20
Size of n
n=50
n=100
n=150
n=200
Output size
67
270
926
1756
Output size
11 161
46 541
106 554
190 598
Fig. 6. Performance test for binary theta-join (θ = ‘‘<’’).
The real data set we employed for experiments is a data set (Cloud) [9] containing extended cloud reports from ships and land station (ftp://cdiac.ornl.gov/pub3/ndp026c/), the data size is 9G.
Fig. 7. Performance test for binary theta-join (θ = ‘‘>’’). Table 3 Memory test result for different algorithms. Data
1-bucket
Spark-SQL
MMF
DM
Memory usage (GB)
27.3
28.5
26.2
25.9
6.2. Binary theta-join evaluation The major factors which affect a binary theta-join include the following: (1) Input data size; (2) Partitions number which we distribute data on different reducers; (3) Output size. For the first factor, we use DBGEN to generate data amount from 1G to 100G and test the performance of each algorithms. For the second factor, we adjust the partitions number from 16 to 80 to observe the performance difference. For the third factor, we adjust the selection condition to control the output data size and record the performance variation. The query we designed for binary theta-join is as follows: Qbinary : Select c_custkey, c_name, o_custkey, o_orderdate From Customer, Orders Where c_custkey θ o_custkey and o_custkey < n; We select Customer and Orders from the TPC-H tables to design the query for binary theta-join, the query is a not only keys type with join keys are c_custkey and o_custkey. Other attributes include c_name from Customer and o_orderdate from Orders. In the WHERE-clause, the theta-join condition is represented by ‘‘c_custkey θ o_custkey’’, here θ can be any operator in {>, <, ≥ , ≤}. Selection condition is ‘‘o_custkey < n’’, which is used to control output size. First, we set θ to be ‘‘<’’ and n equals 5, 10, 15, 20, data amount is 1G, the distribution of join keys is as follows: c_custkey in Customer: 1–150 000, consecutive o_custkey in Orders: 1–150 000, randomly The output results for different n are given in Table 1. We test the performance of our proposed methods MMF, DM, and the standard methods 1-bucket-theta and Spark-SQL on the same queries. Partitions number is 48. Fig. 6 illustrates the results. Time unit is second(s). In Fig. 6, although we tested the four methods, Spark-SQL cannot work out the results as it also uses cross product to deal with theta-join, but it just uses several nodes in the cluster to compute cross product which makes heavy workload on these machines,
and too many of intermediate results in memory also result in OOM errors. We can observe this from the Spark execution logs. 1-bucket-theta distributes the cross product results to different reducers to make load balance in the cluster, so it can get the results. But as it does not filter useless records before computation, the execution efficiency is not good. It achieves the worst performance in the three methods. DM method has the best performance in all the methods. Comparing to 1-bucket-theta, it can achieve 803/13 = 61.8 times faster. The reason is that DM uses two optimized techniques, one is the MMF method to filter useless records before computation, and the other is that it just uses keys to do cross product which reduces the memory cost. MMF method greatly improves the performance of 1-bucket-theta, from the test results, we can see that it achieves 803/18 = 44.6 times faster than 1-bucket-theta. Second, we set θ to be ‘‘>’’ and n equals 50, 100, 150, 200, data amount is 1G, the output size for different n is as follows: The output results for different n are given in Table 2. We also test the performance of 1-bucket-theta, MMF, DM and Spark-SQL on the same queries. Partitions number is 48. Fig. 7 illustrates the results. From Fig. 7, we can see for the selection condition of ‘‘θ = ‘> = ’ ’’, DM also achieves the best performance in the four methods, Spark-SQL achieves the worst. With the increasing of n, the performance of each method decreases, but Spark-SQL decreases much greater than others. The reason is as what we have analyzed before, although MMF and DM are based on 1-bucket-theta, they have filtered much records according to the method we proposed in Section 4 before computation starts, whatever for ‘‘>’’ or ‘‘<’’ theta-join, they improve the algorithm’s efficiency greatly. To illustrate the memory cost of different algorithms, in above test, we also record the memory usage for them. Table 3 shows the result. Records in above table are the memory usage for 1-bucket, Spark-SQL, MMF and DM methods. DM method uses the least
50
W. Liu, Z. Li / J. Parallel Distrib. Comput. 121 (2018) 42–52
Fig. 8. Relationship between partitions number and performance among different algorithms.
memory among four algorithms. In this test, the memory usage for each algorithm is the average value of different control variable n, which is set to 50, 100, 150 and 200. Memory usage varies according to the amount of theta-join’s intermediate results. More intermediate results, more memory usage. This test shows MMF and DM methods reduce the memory cost when computing thetajoin. Third, to illustrate the relationship between partition number and performance, we increase the partitions number from 16 to 80, as the nodes number of cluster is 16, the increment is also set to 16. We set ‘‘θ = ‘>’ ’’ and n equals 100. Fig. 8 illustrates the results. From Fig. 8, we can see with the increase of partitions number, each algorithm’s performance is improved accordingly, but when the partitions number is added to 48, the performance improvement reaches the best. When we continue to add partitions number, the performance begins to be degraded. The test result of Spark-SQL is the same as the other three algorithms, as its running time is much more than the others and is not good for observation in graph, we omit it here. As 48 is the optimal value for partitions number, we use it in each theta-join test. The reason is in that when partitions number is increased, parallelism of the framework is increased too, less processing time is gained. But the partitions number cannot be increased unlimitedly, because for more partitions, more data need to be merged, the merging time is also increased. The parallelism and merging time should be balanced at a suitable value of partitions number. That is why the value of 48 can achieve the best performance in this test. Fourth, we enlarge the data size from 1G to 100G to test the scalability of our algorithms. θ is set to ‘‘>’’ and n is set to 5. Fig. 9 illustrates the results. Time unit is second(s). From Fig. 9, we can see with the increase of data size, running time for each algorithm increases accordingly, but DM method also gains the best performance. Spark-SQL has the worst performance, and 1-bucket-theta is almost the same as it. The MMF method is better than Spark-SQL and 1-bucket-theta but less than DM method. The reason as we have analyzed before, DM uses not only the MMF filter techniques, but also uses the least attributes to do cross product, the intermediate result is reduced to minimum, so it has the best performance. The running time of it almost increases linearly with the growth of the data size. Finally, we evaluate our method DM, baseline methods 1bucket-theta and Spark-SQL on the real data set — Cloud to observe
Fig. 9. Scalability test among different algorithms.
Table 4 Performance test result for real data. Data
1-bucket(s)
Spark-SQL(s)
DM(s)
Improve ratio
Cloud
234
1282
67
3.5, 19.1
the performance difference. The data distribution of join keys is as follows: latitude: −9000–9000 longitude: 0–36 000 The SQL we designed is as follows: Select S.date, S.longitude, S.latitude, T.latitude From Cloud AS S, Cloud AS T Where S.date = T.date AND S.longitude = T.longitude and S.latitude - T.latitude < 5; Table 4 shows the running time for different algorithms when executing the above SQL query. Time unit is second(s). The output size is 170914849. In Table 4, column 2 to column 4 are the running time for 1bucket, Spark-SQL and DM. The column of Improve Ratio includes two values, the first value is the speed-up ratio of DM to 1-buckettheta, and the second value is that of DM to Spark-SQL. The real data test result is in accordance with the synthetic data set. DM method accelerates the query efficiently comparing to the other two methods. 6.3. Multi-way theta-join evaluation For multi-way theta-join test, we select 6 multi-way queries from TPC-H benchmark, the queries’ numbers are Q3, Q5, Q10, Q11, Q20 and Q21. As the join condition for these queries is equal join, we vary one of the equal conditions in each query to theta-join, θ is set to ‘‘>’’, and add control condition of ‘‘n<1000’’ to produce less output results, the data size is 1G. As DM method also uses the MMF techniques, the test result for MMF method is not provided here. Table 5 shows the running time for different algorithms, and the time unit is second(s). In Table 5, columns 2 to 4 are the running time of 1-bucket, Spark-SQL and DM, respectively. Column 5 is the improve ratio which also includes two values, the first value is the speed-up ratio of DM to 1-bucket-theta, and the second value is that of DM to Spark-SQL. From the test result, we can see in multi-way theta-join queries, DM is also faster than the other two methods. The highest improve ratio reaches 77.4 times.
W. Liu, Z. Li / J. Parallel Distrib. Comput. 121 (2018) 42–52 Table 5 Multi-way theta-join test result for TPC-H queries. Query
1-bucket(s)
Spark-SQL(s)
DM(s)
Improve ratio
Q3 Q5 Q10 Q11 Q20 Q21
62 75 245 23 29 651
448 2 168 1 680 26 31 12 225
21 28 87 11 18 376
2.9, 21.3 2.7, 77.4 2.8, 19.3 2.1, 2.3 1.6, 1.7 1.7, 32.5
As multi-way theta-join often involves more than two tables, the intermediate results will be much more than binary theta-join, for example, one cross join on 1G × 1G data will produce 1T data size, two cross joins on 1G × 1G × 1G will produce 1P data size of intermediate results. So in multi-way theta-join of very large data set, OOM errors will occur, that is why we add control condition to reduce this kind of errors. But the main reason for the improve ratio of DM is in that when we do cross join, we just use the useful records and columns which will contribute to the final results, and filter out much useless records to reduce computation cost. Whether in binary theta-join or multi-way theta-join, this kind of method can greatly improve the join efficiency. 7. Related work Processing theta-join in large scale data has always been a challenge in database area. In some early works, like [5,10,13], they proposed their evaluation strategies for complex join queries, but their methods cannot scale to process theta-join in very large data set. In distributed environment, most of the works use MapReduce framework to process joins, but they only focus on equi-joins [3,1]. Some methods can support theta-joins in MapReduce model, but they need fundamental changes of the framework itself [16,9], which makes users have to implement non-trivial functions that manipulate the dataflow in the distributed system. To reduce the network overhead and improve the join efficiency, bloom filter is used to process two-way join and multiway join in distributed systems [20]. Similar work can be found in [12], the main idea of these works is to reduce communication cost for the distributed join computation. But due to the nature of bloom filter which is to check the membership in a set, this kind of methods can just be applied in equi-joins, but not applicable in theta-joins. Data skew is shown to cause poor query cost estimates in distributed large scale data analysis, some works tried to address this problem in joins, DeWitt proposed four equi-join algorithms and showed that traditional hybrid hash join is the winner in lower skew or no skew cases [7]. Pig supports these skewed equijoin implementations on top of MapReduce [2]. Okcan proposed a method which used random algorithm to distribute the cross product results along network, therefore reduced the impact of data skew in a binary theta-join [11]. Our method considers not only the input data amount for cross product, but also the load balance in the cluster, which can avoid the data skew both in binary theta-join and multi-way theta-join. 8. Conclusion In this paper, we proposed two methods to effectively process binary theta-join and multi-way theta-join in distributed environment. The first method is a filter strategy called MMF, which can filter much irrelevant records from input data sets before computing cross product of theta-join, then reduce the computation cost. The second method is DM method which can be used in both binary theta-join and multi-way theta-join. This method just uses the join attribute to do cross product and uses equi-join to combine the
51
result with other output attributes. As just part of the attributes take part in the costly operation, and input data are also filtered by MMF, theta-join efficiency can be greatly improved. We tested our methods through extensive experiments over both synthetic and real world data in a distributed cluster, experiments show that our solution can speed up the query efficiency compared to the state-of-art solutions, the speedup ratio is ranging from 1.6 to 77.4, varying with the theta-join conditions. In the future, we will continue to study new optimizations to improve the performance of our algorithms. Acknowledgments This work is supported by the National Natural Science Foundation of China (No. 61303037, 61732014), the Shaanxi Natural Science Foundation of China (No. 2017JM6104), the National Basic Research Program (973 Program) of China (No. 2012CB316203), and the National High Technology Research and Development Program (863 Program) of China (No. 2012AA011004). References [1] F.N. Afrati, J.D. Ullman, Optimizing joins in a map-reduce environment, in: Proceedings of the 13th International Conference on Extending Database Technology, ACM, 2010, pp. 99–110. [2] Apache pig. http://pig.apache.org/. [3] S. Blanas, J.M. Patel, V. Ercegovac, J. Rao, E.J. Shekita, Y. Tian, A comparison of join algorithms for log processing in mapreduce, in: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, ACM, 2010, pp. 975–986. [4] Y. Bu, B. Howe, M. Balazinska, M.D. Ernst, Haloop: efficient iterative data processing on large clusters, Proc. VLDB Endowment 3 (1-2) (2010) 285–296. [5] S. Chaudhuri, M.Y. Vardi, Optimization of real conjunctive queries, in: Proceedings of the Twelfth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, ACM, 1993, pp. 59–70. [6] J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters, Commun. ACM 51 (1) (2008) 107–113. [7] D.J. DeWitt, J.F. Naughton, D.A. Schneider, S. Seshadri, Practical Skew HandLing in Parallel Joins, University of Wisconsin-Madison, Computer Sciences Department, 1992. [8] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, G. Fox, Twister: a runtime for iterative mapreduce, in: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, ACM, 2010, pp. 810–818. [9] Z. Khayyat, W. Lucia, M. Singh, M. Ouzzani, P. Papotti, J.-A. Quiané-Ruiz, N. Tang, P. Kalnis, Lightning fast and space efficient inequality joins, Proc. VLDB Endowment 8 (13) (2015) 2074–2085. [10] C. Lee, C.-S. Shih, Y.-H. Chen, Optimizing large join queries using a graph-based approach, IEEE Trans. Knowl. Data Eng. 13 (2) (2001) 298–315. [11] A. Okcan, M. Riedewald, Processing theta-joins using mapreduce, in: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, ACM, 2011, pp. 949–960. [12] S. Ramesh, O. Papapetrou, W. Siberski, Optimizing distributed joins with bloom filters, in: Distributed Computing and Internet Technology, Springer, 2008, pp. 145–156. [13] K.-L. Tan, H. Lu, A note on the strategy space of multiway join query optimization problem in parallel systems, ACM SIGMOD Record 20 (4) (1991) 81–82. [14] Transaction processing performance council. http://www.tpc.org. [15] K. Yan, H. Zhu, Two mrjs for multi-way theta-join in mapreduce, in: Internet and Distributed Computing Systems, Springer, 2013, pp. 321–332. [16] H.-c. Yang, A. Dasdan, R.-L. Hsiao, D.S. Parker, Map-reduce-merge: simplified relational data processing on large clusters, in: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, ACM, 2007, pp. 1029–1040. [17] M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica, Spark: cluster computing with working sets, in: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, Vol. 10, 2010, p. 10. [18] X. Zhang, L. Chen, M. Wang, Efficient multi-way theta-join processing using mapreduce, Proc. VLDB Endowment 5 (11) (2012) 1184–1195. [19] C. Zhang, J. Li, L. Wu, M. Lin, W. Liu, Sej: an even approach to multiway thetajoins using mapreduce, in: Cloud and Green Computing (CGC), 2012 Second International Conference on, IEEE, 2012, pp. 73–80. [20] C. Zhang, L. Wu, J. Li, Efficient processing distributed joins with bloomfilter using mapreduce, Int. J. Grid Distrib. Comput. 6 (3) (2013) 43–58.
52
W. Liu, Z. Li / J. Parallel Distrib. Comput. 121 (2018) 42–52 Wenjie Liu obtained her Master Degree in 2003 and Doctor Degree in computer science from the Northwestern Polytechnical University, Shaanxi, Xi’an, China, in December 2009. From 2003, she has been a teacher in this university and worked in the Department of Computer Software and Theories. In 2014, she was a visiting researcher at database lab, Department of Computer Science and Engineering, Hong Kong University of Science and Technology, where she worked on cloud computing and big data processing. Her research interests include cloud computing, distributed database, and massive data management.
Zhanhuai Li is a professor at Department of Computer Software and Theories, School of Computer, Northwestern Polytechnical University, Shaanxi, Xi’an, China. He is a doctorial supervisor, CCF fellow and Database Committee fellow of China. His research interests include steam data management, data mining, massive data management, and cloud data storage.