Information Sciences 507 (2020) 365–385
Contents lists available at ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins
Learned sketches for frequency estimation Meifan Zhang, Hongzhi Wang∗, Jianzhong Li, Hong Gao Department of Computer Science and Technology, Harbin Institute of Technology, China
a r t i c l e
i n f o
Article history: Received 28 January 2019 Revised 15 August 2019 Accepted 18 August 2019 Available online 22 August 2019 Keywords: Sketches Frequency estimation Query processing
a b s t r a c t The Count-Min sketch and its variations are widely used to solve the frequency estimation problem due to its sub-linear space cost. However, the collisions between high-frequency and low-frequency items introduce a significant estimation error. In this paper, we propose two learned sketches called the Learned Count-Min sketch and Learned Augmented sketch. We combine the machine learning methods with the traditional Count-Min sketch and Augmented sketch to improve the performance. We used a regression model trained from historical data to predict the frequencies and separate the high-frequency items and lowfrequency items. The experimental results indicated that our learned sketches outperform the traditional Count-Min sketch and Augmented sketch. The learned sketches can provide a more accurate estimation with a more compact synopsis size. © 2019 Elsevier Inc. All rights reserved.
1. Introduction The problem of summarizing the data stream and estimating the item frequency has been studied for a long time in the database community. Instead of processing the massive raw data, it is much more convenient to use compact synopses to summarize the properties of the original data with less space cost. Many synopses such as sampling [9], wavelets [8,11], histograms [10] and sketches [5,28] are proposed for data summarization. Sketches are widely used to summarize data and estimate item frequency due to its small size. Count-Min is the most widely used to estimate item frequency among all these sketches. The Count-Min sketch is a hashing-based structure consisting of a two-dimensional array. It uses different hash-functions to map each item to different positions in corresponding rows of the array, and add the counts in these positions. The estimation is performed by finding the minimum count in the mapped positions. Due to the hash collisions, the estimations are always over-estimations. Even though the Count-Min sketch is a sub-linear space structure and simple to implement, it results in huge error when the collisions frequently occur. As long as the number of items is greater than the number of cells in each row of the array, there could be collisions. With the growth of the number of items in Count-Min sketch, the frequency of collisions increases. Consider that, if a lowfrequency item always conflicts with other high-frequency items, the frequency estimation of this item will be extremely greater than its true frequency. In order to overcome this problem and reduce collisions, several approaches have been proposed. First, the most direct method is to construct a larger two-dimensional array of the Count-Min sketch [5]. This means that we should use more hash-functions and more cells in each row of the array. However, the expansion of the structure will result in the rise of the space cost and the response time, which affects the performance. Secondly, a number of methods attempt to separate the high-frequency items and low-frequency items, since the collisions between them introduce ∗
Corresponding author. E-mail address:
[email protected] (H. Wang).
https://doi.org/10.1016/j.ins.2019.08.045 0020-0255/© 2019 Elsevier Inc. All rights reserved.
366
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
significant estimation error [13,28,46]. However, they introduce new issues. Such an Augmented sketch adds an additional filter above the sketches. Thus, it increases the estimation accuracy of the high-frequency items in the filter while reducing the accuracy of the low-frequency ones in the backup Count-Min. In addition, they focus on querying the high-frequency items in the filter and leaving a high relative error for the low-frequency items. Third, some methods like CU [40] only increments the smallest counter for each insertion to reduce the estimation error. However, it does not support deletion. As a result, it does not receive as wide an acceptance in practice as the CM sketch [41]. Based on the above discussion, most of the prior methods focus on one or two aspects in accuracy, space cost and efficiency while producing other problems. These problems motivated us to develop a new sketch that can increase the accuracy without expanding the space cost and reducing the efficiency. First, we need more compact structures efficiently estimating item frequencies with little space cost. Secondly, in order to reduce the estimation error caused by the collisions, we should separate the high-frequency items from the low-frequency ones to exclude the high-frequency items from the traditional sketches. Thus the collisions between the low-frequency and high-frequency items are avoided, and the lowfrequency items will never be estimated as high-frequency ones. To tackle the first challenge, we adopt a learning technique to obtain a compact regression model from a historical data distribution to predict frequencies for a part of the items. Such a regression model can efficiently predict the frequency with little space cost. The historical data can be captured by collecting the frequency query results. If there is no historical data, we can regard the frequencies of items in a small subset of the whole dataset as the historical data. The subset can be obtained by random sampling or just picking the first m elements of the dataset. To tackle the second challenge, we use the learned regression model combined with frequency boundaries to distinguish the high-frequency items and low-frequency items. In order to prevent the misclassification caused by the prediction error, we adjust the frequencies in the training set. We increase the high-frequencies and reduce the low-frequencies to increase their distances to the frequency boundary. Thus the frequency predictions of the high-frequency items have little chance to be lower than the frequency boundary as long as the model is sufficiently accurate. That is, the high-frequency items are no longer misclassified as low-frequency items. Similarly, the low-frequency items also have little chance to be misconceived as high-frequency items. Thus the estimation accuracy increases. We proposed two learned sketches based on our main idea. The first one is called the Learned Count-Min sketch (LCM), which is a combination of a model learned from a historical distribution and a backup traditional Count-Min sketch. The predictions from the learned model are regarded as the frequency estimations of the high-frequency items. The low-frequency items are inserted into the backup Count-Min sketch, from which their frequencies are estimated. The second sketch is called the Learned Augmented sketch (LAS). Inspired by the Augmented sketch, we add a filter to our Learned Count-Min sketch. We use the filter to store the item frequencies for the top-K high-frequency items. The model and the backup Count-Min sketch are still used to conduct the remaining high-frequency items and low-frequency items, respectively. The improvement of the Learned Augmented sketch from the Learned Count-Min sketch is that the LAS increases the estimation accuracy for the heavy hitters in the filter. The proposed learning-based sketches could provide more accurate frequency estimations with more compact synopses. The model obtained by learning gives a reasonable prediction for the high-frequency items with little space cost. Since it is more difficult to capture the property of the distribution of the low-frequency items, we just store them in traditional sketches. The accuracy of the Count-Min sketch increases with the reduction of the number of items in the Count-Min. We can also reduce the size of the backup Count-Min sketch in order to reduce the total space cost without reducing the accuracy compared to the traditional sketches. We make the following contributions in this paper. • Our first contribution is the Learned Count-Min(LCM) sketch. We use a regression model to learn the frequencies of items from the historical data. The hypothesis presented in reference [15] indicates that a three-layer artificial neural network (ANN) can represent any functions at any precision. We use the strategy of adding additional-frequencies (hereinafter referred to as “offsets”) to the frequencies in the training data to avoid misclassifying the high-frequency items and low-frequency items. The low-frequency items will be stored in a backup Count-Min sketch. We prove that the LCM sketch can provide an estimation guarantee no worse than the traditional CM sketch with less space cost. • We develop the Learned Augmented sketch(LAS). We add a filter storing the frequencies of the top-K frequent items on the top of the LCM sketch. We also learn a model predicting the frequencies of items, which is used to separate the top-K items, high-frequency items and low-frequency items. The model will provide reasonable over-estimations for the other high-frequency items (except top-K). We still insert the low-frequency items into a backup Count-Min sketch. • We conduct extensive experiments. We compare our learned sketches with the traditional Count-Min and Augmented sketch. The experimental results demonstrate that our methods outperform the traditional Count-Min and Augmented sketches. Our learned sketches can provide more accurate frequency estimations with more compact synopses. The remaining sections of this paper are organized as follows. In Section 2, we survey related work for this paper. In Section 3, we review the preliminaries about the traditional Count-Min sketch and Augmented sketch. In Section 4 and Section 5, we introduce the Learned Count-Min sketch and the Learned Augmented sketch, respectively. In Section 6, we analyze the extensions of our method. In Section 7, the experimental results indicate the performance of our algorithms. In Section 8, we provide the conclusions and give a brief overview of our future work.
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
367
2. Related work The issues of frequency estimation and streaming data summarization have been extensively studied. Frequency estimation can be applied in many areas such as databases, networking [22,36], and sensor analysis [33]. Synopses such as sampling, wavelets, histograms and sketches are widely used in this area [4]. Linear sketches are typically used for frequency estimation. A large number of sketches are proposed for frequency estimation such as Count sketches [6], Count-Min sketches [5], Augmented sketches [28], CU sketches [40], Count-Min-Log sketch [27] and Bias aware sketch [3]. There are many factors to measure the quality of each sketch, such as compactness, accuracy, efficiency and queries support [3]. Thus, a good sketch should have a small size and efficiently provide estimations close to the accurate results. However, it is difficult to achieve these factors at the same time. Usually, there is a trade-off among them. The CountMin Sketch is the most typical and useful sketch [28,30]. It is widely adopted in many areas such as data analysis, natural language processing, query processing and optimizations [4,24,31,44]. It can give approximations for many statistical queries such as point queries, range queries, and inner product queries. Its applications also include displaying the list of best selling items, the most clicked-on websites, the hottest queries on the search engine, and the most frequently occurring words in a large text [24,31,44]. The Count-Min sketch is a hashing-based structure consisting of a two-dimensional array. This sketch can also be viewed as a small size counting version of a Bloom filter. The difference is that the Count-Min sketches use more rows and store the counts in cells. The significant advantage of the CM sketch is that the Count-Min sketch improves the space bounds of previous results [6] from 1/ 2 to 1/ and the time bounds from 1/ 2 to 1. In addition, it supports many kinds of queries while most previous studies only support one kind of query [6,12]. Some other variants of the Count-Min sketch have been proposed in recent years. The CU sketch uses a structure similar to the Count-Min sketch, but it only increments the smallest counter in each iteration. It improves the query accuracy but does not support deletion. The Count-Min-Log sketch uses logarithm-based, approximate counters [7,35] instead of linear counters to improve the average relative error of the CM sketch. It improves the accuracy at the cost of suffering from both over-estimation and under-estimation errors. The Augmented sketch is a standard two-layers sketch. It adds an additional filter layer on the top of the CM sketch. Filters play different roles in different systems and different areas [21,38]. The filter in the Augmented sketch is used to dynamically store the frequencies of the top-K items. It increases the estimation accuracy of the items in the filter, while introducing more error to the frequencies of low-frequency items. Some of the new sketches in past years use structures similar to the Augmented sketch. For example, the Cold Filter [46] also uses a twostage structure. It puts the cold items in the first stage and the hot items in the second stage. When a new item arrives, this sketch first attempts to insert it into the cold stage. For the top-K estimation problem, counter-based structures are more appropriate, and they perform faster than the sketches. The counter-based structures only keep the estimations for the hot items, while the sketches usually give approximate counts for all the items. Some studies are based on Space-Saving [29], the main idea of which is to count only the occurrences of the frequent items. The Scoreboard Space-Saving(SSS) [13] was developed from Space-Saving. It uses the counting bloom filter as the scoreboard to predict whether an item is hot. The score separates the item into three types: cold items, potential hot items and hot items. Even though these variations improve the performance of the Count-Min sketch, most of them only focus on one or two aspects in accuracy, space cost and efficiency while bringing other problems. Our work considers all these aspects and attempts to capture a compact sketch efficiently estimating item frequencies with a high accuracy. Most of these methods attempt to increase the accuracy through separating the high-frequency items and low-frequency items since the collisions between them introduce a significant error. The idea of the Learned Index inspires us to pick a new way to estimate frequencies and classify the high-frequency and low-frequency items. In [19], the authors suggest that standard index structures and related structures, such as Bloom filters, could be improved by using machine learning methods. They propose a learned Bloom filter that can predict whether or not an item is a key, and use a backup Bloom filter to restore the false negative rate to 0. In reference [16], the authors proposed a learning-based frequency estimation approach. It uses a machine learning method to separate the high-frequency and low-frequency items. The high-frequency items are assigned to unique buckets, while other items are assigned to traditional sketches such as count-min and count sketches. However, when dealing with big data, storing high-frequency items in unique buckets still costs a lot of space. Motivated by this, we attempt to cut down the space cost by using a tiny-size model to predict the frequencies of a large number of items instead of storing them. Thus, the high-frequency items are excluded to the Count-Min sketch. The major difference is that the existing method focuses on increasing the accuracy of the estimations for the sketches with a sufficient space, while our method emphasizes on raising the accuracy of the sketches in a limited size. This property makes our method more suitable for big data. There are some other research studies combining machine learning methods with data processing, data analysis and techniques in other areas in recent years. [1,14,37] use neural network as hash functions to map a large dimensional space to a smaller one. In [17,43], the researchers try to model the cumulative distribution function for ranking and sorting. Reference [2] reported that rule base fuzzy logic approach such as [39] could be combined with ANN for switching decision making. There are also some research studies mentioning the limitation of machine learning methods compared to traditional techniques [26,45]. Making combinations of machine learning methods and other techniques is an interesting field, which still needs optimizing.
368
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
Fig. 1. Count-Min sketch.
Fig. 2. Augmented sketch.
Even though many efforts have been performed to apply the machine learning methods to data analysis, ranking and indexing, machine learning methods have rarely been applied to improve the performance of frequency sketches. We realize the advantages of a learned model including the compaction and the efficiency of prediction. Therefore, we make use of them to construct desired more compact, more accurate and more efficient frequency sketches.
3. Preliminaries We will develop our learned sketches based on two existing sketches. In this section, we will review them, i.e., CountMin sketch (CM) and Augmented sketch (ASketch). Count-Min achieves the best update throughput in general, as well as a high accuracy on skew distribution. The Augmented sketch is proposed on the top of the Count-Min sketch, as it adds an additional filter which is used to store the high-frequency items. The ASketch provides more accurate frequency estimation for the items in the filter.
3.1. Count-Min sketch The Count-Min sketch is used to approximately maintain the counts of a large number of distinct items in a data stream, as shown in Fig. 1. A Count-Min sketch with parameters ( , δ ) is represented by a two-dimensional array counts with width w and depth d, i.e., count[1,1], ..., count[d, w]. Given parameters ( , δ ), then w = e/ and d = ln(1/δ ). The d hash functions h1 , ..., hd : {1, ..., n} → {1, ..., w} are used to update the counts of different cells in this two-dimensional array. When an update(xi , c) arrives, meaning that the frequency of item xi is increased by a quantity of c, c is added to the count in each row, and the sketch is updated count[ j, h j (xi )] ← count[ j, h j (xi )] + c. The count of an item xi is estimated by minj ∈ [1,d] (count[j, hj (xi )]). This is an over-estimation due to the hash collisions aˆi ≥ ai , where ai is the count of the item xi and aˆi is the estimation. The estimation has an upper bound, with the probability at least (1-δ ), aˆi ≤ ai + A 1 , where A 1 = ni=1 |ai | [5]. 3.2. Augmented sketch The Augmented sketch (ASketch) is proposed on the top of the Count-Min sketch(CM). The CM sketch may provide an inaccurate count for the most frequent items and misclassify the low-frequency items as high-frequency items. In order to solve these problems, the ASketch adds a pre-filtering stage to the sketch stage (Fig. 2). The filter retains the high-frequency items, and the sketch processes the tail of the distribution. This sketch improves the frequency approximation accuracy for the high-frequency items in the filter. In addition, the ASketch also reduces the collision between the high-frequency and low-frequency items in the sketch stage. There is a trade-off between the filter size and the accuracy of the frequency estimation. The accuracy for the highfrequency items obviously increases with the size of the filter. However, for the sake of keeping the same space cost, the sketch size should be reduced at the same time, which results in reducing the estimation accuracy of the low-frequency items stored in the sketch.
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
369
Fig. 3. Learned Count-Min sketch.
4. Learned Count-Min sketch The collisions between the high-frequency and low-frequency items are important causes of the estimation error in the CM sketch. Since a low-frequency item can misleadingly appear as a high-frequency item, the collisions between them will result in a significant estimation error for the low-frequency items [28]. Usually, the traditional CM sketch handles the highfrequency and low-frequency items in the same way. Consequently, it cannot avoid the collisions between them. If we can learn the properties of the high-frequency items and low-frequency items, and train a model to distinguish between them, we may find ways to reduce the collisions between them. Based on this idea, we propose a new sketch called the Learned Count-Min sketch(LCM). The framework of the LCM sketch is shown in Fig. 3. We first learn a model f from historical data that can predict the frequency of each item. We compute the boundary P separating the high-frequency and low-frequency items based on the historical data. Such a boundary P is determined according to the rate of the total-frequencies of low-frequency items. We regard the items with frequencies higher than P as the high-frequency items, and those with frequencies lower than P as the low-frequency items. Once an item x arrives, the LCM predicts the frequency of x according to the model f. If the prediction f(x) < P, then the item will be regarded as a low-frequency item and inserted into the backup Count-Min sketch. The frequency estimation of a highfrequency item is the prediction f(x). The frequency estimation of a low-frequency item will be calculated from the backup Count-Min sketch by a traditional method. In the LCM, we want to hold the over-estimation property of the CM, since the one-sided error guarantee makes sure that every heavy hitter is not missed. It also allows a simple calculation of the failure probability. In addition, the one-sided error bound brings benefits for many applications using sketches [5]. However, it is difficult to ensure that the prediction from the model f is higher than the exact frequency. Therefore, we add reasonable positive offsets to the high-frequency items and negative offsets to the low-frequency items in the training set. This strategy has two benefits. Firstly, it keeps the over-estimation property as we already discussed above. We add positive offsets to the high-frequencies in the training set, so that the distance between a high frequency and the frequency boundary increases. Consequently, the model f can return an over-estimation for a high-frequency item as long as the prediction error is smaller than the offset added to the frequency. Since the low-frequency items are held in the backup Count-Min sketch, their estimations are still over-estimations. Secondly, it can prevent the misclassification caused by the prediction error as discussed in the introduction. The offsets increase the distance between the frequencies and the boundary. That is, the high-frequency items have little chance to be misclassified as low-frequency items as long as the prediction error is smaller than the offset. Similarly, the low-frequency items also have little chance to be misclassified as high-frequency items. In this way, the LCM sketch separately deals with high-frequency and low-frequency items, leaving no chance of collisions between them. In addition, since the model f only costs a little space, and the number of items in the backup CM also decreases, the LCM is a more compact sketch compared to the traditional Count-Min sketch. Based on above discussions, the LCM is defined as follows. Definition 1. A Learned Count-Min sketch (LCM) is composed of a frequency model f estimating the frequency of each item x by the prediction f(x), and a backup standard Count-Min sketch with parameters ( , δ ). An LCM is constructed in two steps. The first step is training the frequency model f. The model can be learned from historical data. We use a boundary to separate the items into the high-frequency items and low-frequency items according to the predictions. Note that the model is not only used as a filter to separate the high-frequency and low-frequency items, but can also be regarded as a frequency storage for the high-frequency items. In the second step, we insert the low-frequency items into the Count-Min sketch. The details of the LCM construction will be introduced with the description of the pseudo code in Algorithm 1. The pseudo code of learning the model f and constructing the LCM is shown in Algorithm 1. We first collect the training set X containing the items and Y containing the frequencies of the historical data (Line 3). We then compute a new frequency set YS containing the same elements with those in Y, but sorted in an ascending order (Line 4). In order to separate the high-frequency items from the low-frequency ones, we use a threshold t to find the frequency boundary P (Line 5). The
Y 1 threshold t = CM , where YCM 1 means the total-frequencies of the items in the backup Count-Min, and Y 1 denotes Y
1
the total-frequencies of all the items in Y. It is determined by the parameters ( , δ ) and the error bound ERRORCM of the ERROR back-up Count-Min sketch, t = Y CM . 1
370
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
Algorithm 1 Learned Count-Min sketch. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:
Input: TrainingSet, InputData, threshold t, CM parameters( , δ ), θ Output: P , model f TrainingSet X → Y , X = {x1 , x2 , ..., xn }, Y = {y1 , y2 , ..., yn }, each xi is one key and yi is the frequency of the key. Y S ← Sort (Y ) p P ← MAX {Y S[ p]| i=1 Y S[i] ≤ t Y 1 } Y = {yi |yi = yi + tanh(yi − P )θ Y 1 , yi ∈ Y } while One of the Condition 1 and Condition 2 is not satisfied do Train( f, X, Y ) end while Input (key, f requency ) if f (key ) < P then update(key, f requency ) in Count-Min ( , δ ). end if Table 1 An example dataset for CM and LCM. Item Frequency
1 1
2 2
3 3
4 5
5 10
6 10
7 5
8 3
9 2
10 1
Table 2 The CM sketch in the example. h1 (x ) = (x + 4 ) mod 5 h2 (x ) = (2x + 3 ) mod 7 mod 5
11 8
7 20
6 4
7 5
11 5
We add an offset tanh(yi − P )θ Y 1 to the frequencies in the training set (Line 6). θ is a constant parameter in (0, 1/2), tanh(yi − P ) is in (-1, 1), and Y 1 is the upper error bound of the Count-Min. We set the parameter θ < t to make sure that the prediction error is not higher than the Count-Min estimation error. tanh(yi − P ) is used to make sure that the offset is positive for the high frequency items (yi > P) and negative for the low frequency items (yi < P). This step is used to restore the one-sided error guarantee of the Count-Min sketch. The one-sided error has many benefits as we already discussed. Since a prediction from the model is possible to be lower than the true frequency, we add positive offsets to the frequencies higher than P and negative offsets to the frequencies lower than P. In this way, the model can hold the over-estimation property by limiting the prediction error. We use a 3-layer neural network to train the model f: X → Y until the following two conditions are satisfied(Line 7-9). Condition 1: ∀xi ∈ {xj |f(xj ) ≥ P}, ( f (xi ) ≥ yi ≥ P ) ∧ ( f (xi ) − yi ≤ θ Y 1 ) Condition 2: ∀xi ∈ {xj |f(xj ) < P}, f(xi ) < yi < P These two conditions ensure that the model f returns over-estimations with limited error for the high-frequency items, and the low-frequency items are not misclassified as high-frequency items. After training the model, we could use it to construct the backup Count-Min sketch with the coming items. The lowfrequency items will be stored in a backup Count-Min sketch (Line 10–13). Algorithm 2 describes the process of using the LCM to estimate the frequency of an item x. We first get the prediction according to the model f. If f(x) ≥ P, meaning that x is a high-frequency item, the algorithm will return f(x) as the estimation result. Conversely, the algorithm will return the estimation according to the backup Count-Min. Algorithm 2 Frequency estimation with learned Count-Min sketch. Input: Element x, frequency model f , backup CM (d = ln(1/δ )) Output: Frequency Estimation of x if f (x ) ≥ P then return f (x ) 3: else 4: return min j∈[1,d] count[ j, h j (x )] 5: end if 1:
2:
We use the following example to illustrate the benefit of the Learned Count-Min sketch. Example 4.1. We want to construct sketches for the data in Table 1. The Count-Min sketch with two hash-functions is shown in Table 2. Suppose that the frequency boundary P in the LCM is 5 in this example. The frequencies of the items (4,5,6,7) will then be predicted according to this model. The frequencies of the remaining items (1,2,3,7,8,9) will be stored
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
371
Table 3 The backup Count-Min in the LCM sketch in the example. h1 (x ) = (x + 2 ) mod 3 h2 (x ) = (2x + 1 ) mod 8 mod 3
2 3
5 6
5 3
Fig. 4. Learned augmented sketch.
in a smaller backup Count-Min sketch shown in Table 3. We attempt to estimate the frequency of “item 2”. The estimation from CM is min(h1 (2 ), h2 (2 )) = min(7, 8 ) = 7. The estimation from the backup Count-Min in LCM is min(h1 (2 ), h2 (2 )) = min(5, 3 ) = 3. We can learn that the estimation from the LCM is closer to the exact answer compared to that from the CM. The query item in CM conflicts with other high-frequency items, so the estimation for the low-frequency item “2” is much higher than the exact answer. However, since the backup Count-Min in LCM only stores the low-frequency items, the collision does not cause much error. We use the following theorem to illustrate that the Learned Count-Min sketch in a smaller size can provide an accuracy guarantee no worse than that provided by the traditional Count-Min sketch. Theorem 1. The Learned Count-Min sketch with a threshold t and a backup Count-Min sketch with parameter ( t , δ ) can return the accuracy guarantee no worse than that provided by a standard Count-Min sketch with parameter ( , δ ), on the assumption that the items for testing the model share the same distribution with the query items, and the guarantee of the model works not only on the test set but also on the query items. Proof. A Count-Min(CM) sketch with parameters ( , δ ) can return the estimated frequency aˆi for an item xi with guarantees: ai ≤ aˆi and with probability at least (1-δ ), aˆi ≤ ai + A 1 [5]. The LCM sketch is composed of a model f and a backup CountMin sketch with ( t , δ ). We will prove the guarantee of the LCM sketch in the following two steps. (1) According to the definition of LCM, and on the assumption that the guarantee provided by the model f on the testing set also works on the query items in {xi |f(xi ) ≥ P}. Each point query key in {xi |f(xi ) ≥ P} will obtain the estimation aˆi = f (xi ) returned by the model f with an accuracy guarantee: ai ≤ aˆi ≤ ai + 2θ A 1 . (2) {xi |f(xi ) < P} is the set of keys in a Count-Min sketch with ( t , δ ), and ACM = {ai | f (xi ) < P } denotes the frequencies of the items in the backup Count-Min sketch. According to Theorem 1 in [5], the backup Count-Min sketch returns the estimation aˆi with accuracy guarantees: ai ≤ aˆi and with probability at least (1-δ ), aˆi ≤ ai + t ACM 1 . Since t · ACM 1 = · ACM 1 · A ≤ 1 t
A 1
A 1 , the backup Count-Min with parameters ( t , δ ) achieves the same guarantee with a traditional Count-Min sketch with parameters ( , δ ).
That is, no matter the frequency of an item x is predicted by the model f(x) or estimated by the backup Count-Min with parameters ( t , δ ), the upper error bound of the LCM is no worse than a traditional CM sketch with parameters ( , δ ). In addition, since w = e/, the backup Count-Min sketch reduces the space cost from w · d to w · d · t, where t ∈ (0, 1). We can learn from the above Theorem 1 that the LCM is really more compact and useful when the data distribution can be learned from the historical data. There is a higher probability that the high-frequency items have statistical properties that can be modelled compared to the occasional low-frequency items coming with incident. Therefore, we try to model the properties of the high-frequency items and leave the low-frequency items to the CM sketch. 5. Learned augmented sketch The idea of the Learned Augmented Sketch(LAS) is similar to that of the Learned Count-Min sketch. LAS is composed of a filter, a model f and a traditional Count-Min sketch. The difference is that we add a filter to store the frequencies of the top-K frequency items. Thus, for the items in the filter, it could provide more accurate estimations. The framework of the Learned Augmented Sketch is shown in Fig. 4. We separate the top-K items, high-frequency items and low-frequency items based on the learned frequency model and two boundaries. The frequencies of the top-K and low-
372
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
frequency items are stored in and estimated by the filter and the Count-Min, respectively. The frequencies of the remaining high-frequency items are directly estimated by the frequency model. Based on these discussions, LAS is defined as follows. Definition 2. The Learned Augmented sketch (LAS) is composed of a filter containing top-K high-frequency items and their frequencies, a learned model f estimating the frequency of each item x and a backup standard Count-Min sketch with parameters ( , δ ). An LAS is constructed in two steps. The first step is training the frequency model f. The training process is similar to that of the construction of the LCM. In the second step, the top-K items will be inserted into the filter and the low-frequency items will be inserted into the backup Count-Min. The details of the LAS construction will be introduced along with the description of the pseudo code in Algorithm 3. Algorithm 3 Learned Augmented sketch. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:
Input:TrainingSet, InputData, threshold t, CM parameters( , δ ), θ Output:P , Ptop , model f , f ilter TrainingSet X → Y , X = {x1 , x2 , ..., xn }, Y = {y1 , y2 , ..., yn }, each xi is one key and yi is the frequency of the key. Y S ← Sort (Y ) p P ← MAX {Y S[ p]| i=1 Y S[i] ≤ t Y 1 } Ptop ← Y S[n − k + 1] for each yi ∈ Y do if yi ≥ Ptop then yi = yi + Y 1 else yi = yi + tanh(yi − P )θ Y 1 end if Y = Y + {yi } end for Ptop ← Ptop + 2θ Y 1 while One of the Condition 1, Condition 2 and Condition 3 is not satisfied do Train( f, X, Y ) end while input (key, f requency ) if f (key ) < P then update(key, f requency ) in Count-Min( , δ ). else if f (key ) > Ptop then if (key is in the filter) or (The filter is not full) then insert (key, f requency ) in the filter. else update(key, f requency ) in Count-Min( , δ ). end if end if
The pseudo-code of learning the model f and constructing the LAS is shown in Algorithm 3. We want to learn a model f mapping the keys to their frequencies. The frequency boundary P is obtained in the same way as that in Algorithm 1(Line 5). In order to distinguish the top-K items from others, we add another boundary Ptop (Line 6). We still add offsets to the frequencies in the training set(Line 7–14). We provide an explanation of the two offsets in the construction procedure. We add an offset Y 1 to the frequencies of the potential top-K items(Line 9). Since we will accumulate the actual frequency of the potential top-K items in the filter, the prediction accuracy of these items does not influence the final estimation result. We just use the prediction to separate the top-K items from the high-frequency items, so we add a bold offset to their frequencies. For the high-frequency items, we then add an offset tanh(yi − P )θ Y 1 with the same motivation as adding that in the LCM sketch(Line 11). We then modify the boundary Ptop (Line 15). The reason is that the frequencies less than Ptop in Y have a chance to be larger than Ptop after adding the offset tanh(yi − P )θ Y 1 . In order to prevent misclassifying the high-frequency items(excluding top-K items) as top-K items, the boundary for the top-K items Ptop is changed to Ptop + 2θ Y 1 . 2θ Y 1 is two times greater than tanh(yi − P )θ Y 1 . Therefore, the other high-frequency items have no chance to be misclassified as top-K items as long as the prediction error is smaller than the offset |tanh(yi − P )θ Y 1 |. Since the offset added to the top-K items is much higher, there is still a gap between the boundary Ptop and the top-K frequencies. Therefore, the top-K items also have no chance to be misclassified as other kind of items.
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
373
We also use a 3-layer artificial neural network to train the model f: X → Y until the following three conditions are satisfied(Line 16-18). Condition 1: ∀xi ∈ {xj |f(xj ) ≥ Ptop }, (yi ≥ Ptop ) Condition 2: ∀xi ∈ {xj |P ≤ f(xj ) < Ptop }, (P ≤ yi ≤ f (xi )) ∧ ( f (xi ) − yi ≤ θ Y 1 ) Condition 3: ∀xi ∈ {xj |f(xj ) < P}, f(xi ) < yi < P The first condition makes sure that the top-K items are recognized without misclassification. The last two conditions play the same role as the two conditions in the LCM sketch. In this way, we distinguish the top-K items, high-frequency items and low-frequency items with only one feature instead of adding another feature to label the classification. The frequencies of the top-K items will be accumulated in the filter, and the low-frequency items will be stored in a backup Count-Min sketch (Line 19–28). After constructing the LAS, it can be used to estimate the item frequencies. Algorithm 4 describes the process of using the LAS to estimate the frequency of an item x. We first look up the query key x in the filter and obtain the estimation from the filter as long as it is in the filter. If x is not in the filter, we obtain the prediction according to model f. If P ≤ f(x) < Ptop , and the algorithm will return f(x) as a reasonable over-estimation. In other cases, it returns the estimation according to the backup Count-Min sketch. Algorithm 4 Frequency estimation with Learned Augmented sketch. Estimating the frequency of key x 1: 2: 3: 4: 5: 6: 7:
if x is found in the filter then return count[x] from filter else if P ≤ f (x ) < Ptop then return f (x ) else return min j count[ j, h j (x )] end if
The LAS adds a filter on top of the LCM sketch. The additional filter occupies some space. In order to keep the space cost the same as that of the LCM, the size of the backup Count-Min sketch in LAS should be reduced. Meanwhile, the estimation accuracy of the backup Count-Min sketch will decrease with the reduction in size. We use the following theorem to demonstrate that the accuracy reduction caused by adding the filter is limited. Theorem 2. Consider a Learned Count-Min(LCM) containing a backup Count-Min sketch with parameter ( , δ ), and a Learned Augmented sketch(LAS) with the same space cost of LCM. Let |F| denotes the filter size in the LAS, then (1) the parameters of the backup Count-Min sketch in LAS is ( e e |F | , δ ), (2) the frequency estimation error raise aˆi of the backup Count-Min is bounded with the following guarantee:
Pr aˆi > e
N |F | wd
− F req(F ) w − |Fd|
− ln(1/δ )
≤ 1/e
(1)
where d = ln(1/δ ), w = e/, |F| denotes the number of distinct items in the filter and Freq(F) denotes the total frequencies of the items in the filter. Proof. (1) We first calculate the parameters of the backup Count-Min in LAS. Consider a LCM sketch with a backup Count-Min constructed with parameters ( , δ ). The number of hash-functions d = ln(1/δ ) and the number of cells in each row: w = e/. In order to maintain the same space cost, the w in the LAS will be reduced to w = w − |Fd | = e − |Fd | . The corresponding in LAS is then = we = e e |F | . − ln(1/δ )
(2) We prove the limitation of the error increase of the backup Count-Min in LAS. In the Count-Min sketch with parameters ( , δ ), there are d = ln(1/δ ) hash-functions and w = e/ cells in each hash-function range. On average, N/w counts are hashed into each cell. The range of each hash-function is reduced by |F|/d. Since we do not exchange the items between the filter and backup Count-Min, the item frequencies in the filter |F | Freq(F) will not remain in the Count-Min anymore. Therefore, Nwd − F req(F ) counts are accumulated in the remaining w = w − |Fd | cells. Therefore, E(aˆi ) =
Pr aˆi > e · E(aˆi ) = Pr aˆi > e
N |F | wd
N |F | −F req (F ) wd |F | w− d
− F req(F ) w − |Fd|
. According to the Markov inequality,
≤ 1/e
374
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
Since |F| is much smaller than w · d, and Freq(F) occupies a large part of N especially in the skew distribution, the E(aˆi ) will not be a large number. Theorem 2 indicates that adding the filter does not cause a significant accuracy reduction of the backup Count-Min sketch in LAS. Thanks to the model f in our LAS, the accuracy reduction for the low-frequency items is limited. The reason is that the model makes reasonable estimations for a significant number of high-frequency items, meaning that fewer items will be stored in the Count-Min sketch. Therefore, the estimation accuracy for the low-frequency items in the backup Count-Min sketch in the LAS does not reduce as much as that in the AS. The actual performance will be shown in the experimental results of Section 7. 6. Extensions In this section, we discuss the extensions of our learned sketches. In the first part, we extend our learned sketches to support deletion. In the second part, we discuss the strategies to accommodate our learned sketches to the distributions with no evident properties. In the third part, we modify our frequency model to adjust the dynamic data distribution. 6.1. Deletion (Negative update) of items Supporting deletion or negative update will make a sketch more practical. The NegativeUpdate(x, c) means reducing the frequency of an item x by c. We develop the NegativeUpdate(x, c) function to make our sketch support the deletion and negative update. We take LCM as an example to show how it can be modified to support deletion. The way of enabling the LAS to support deletion is similar. NegativeUpdate(x, c) in the LCM: (1) If the prediction f(x) < P, meaning that the item x is in the backup Count-Min, we just adopt the method in the traditional Count-Min update(x, −c ). (2) If the prediction f(x) ≥ P, meaning that the item x should be directly estimated by the model f. We use a hash-table Hdel with a limited size to store the item and its negative frequency (x, −c). The estimation process should be modified accordingly. The estimation should be ( f (x ) + Hdel [x] ). When the hash-table is full, the model f need to be retrained with the updated frequencies. The training set should includes the items in Hdel , random high-frequency items and items in the backup Count-Min. After retraining the model, the hash-table will be emptied. According to the differences between the LCM and LAS, this method can also be applied to the LAS with a slight modification of adding a way to deal with the frequencies in the filter. It is easy to directly carry out the NegativeUpdate(x, c) in the filter. The ways of implementing the negative update to the other high-frequency and low-frequency items are the same as those for the LCM. 6.2. Modeling distributions with no evident properties It is usually easy to model a smooth continuous function with proper machine learning methods and proper parameters. However, it is difficult to model a non-smooth function. The learned model could be very accurate for the training data while returning low accuracy estimation for the items in the testing set due to the over-fitting. In this situation, we consider to separating the data into small parts. We could divide not only the range of items, but also the range of frequencies to separate the data. We then try to model the distribution of each part and combine the models working well on both the training and testing data to predict the item frequencies. For the areas difficult to be modelled, we just insert them into the backup Count-Min sketch. In this way, the learned methods still have a chance to save space cost as long as the data is not divided into too many parts, and each model is formed into a tiny size. For example, suppose there is a set S of 10 pairs in the form of (item, frequency): {(1,5), (2,6), (3,7), (4,6), (5,5), (6,12), (7,2), (8,13), (9,3), (10,16)}. The frequency distribution of the first five items are quite smooth, while the last five frequencies fluctuate more. We can learn a model for the first five items, and dispose of the remaining items in the Count-Min sketch. Even though it is difficult to model all the frequency-items, we can still apply the idea of modelling a part of the items in order to reduce the required space. 6.3. Sketches with a dynamic frequency-model In previous sections, we proposed two sketches based on the frequency model learned from historical data. In this section, we discuss the method to handle the dynamic data distribution, suggesting that the distribution of the incoming data could be different from that of the historical data. The learned frequency model should then be modified to be adapted to the dynamic distribution. Since the frequencies of the low-frequency items are maintained in the Count-Min sketch, the accuracy of their estimations is still under control due to the mechanism of the Count-Min sketch. However, the frequencies of the high-frequency items are not under supervision. Therefore, we add a monitor to supervise the change in the distribution of the items classified as high-frequency ones by the initial model. Algorithm 5 describes the process of updating the learned sketch to adjust to the dynamic data distribution. We first regard the LCM as the learned sketch in this algorithm. The parameter γ in the Input is used to limit the sensitivity of the monitor to the change in the data distribution (Line 1). We get k random elements from the high-frequency items (Line
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
375
Algorithm 5 Dynamic Learned sketch. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:
Input:γ ∈ (0, 1 ) {Item1 , Item2 , ..., Itemk } ← k random elements from {x| f (x ) > P} Monitor=[(Item1 , 0 ), (Item2 , 0 ), ..., (Itemk , 0)] while Constructing the backup CM do update frequencies of items in the Monitor if MonitorTime() then Dis = KL( prob(Monitor )|| prob(LCM.est (Monitor ))) if Dis > γ then NewT rainSet ← Monitor NewT rainSet ← {(x, LCM.est (x ))| f (x ) < P } fnew ← train f with NewT rainSet end if end if end while
2), then we initialize the Monitor with these k items and their initial frequency 0 (Line 3). In the process of constructing the LCM with the incoming elements, the frequencies of the items in the Monitor are updated (Line 5). We periodically detect the change in the frequency distribution of the items in the monitor. MonitorTime() is a start condition of detecting that change (Line 6). We use the Kullback–Leibler divergence (KL) [20] to measure the difference between the probability distribution of the frequencies in the Monitor and that of the frequencies estimated by the current LCM sketch (Line 7). If the difference exceeds the threshold γ , then the items in the Monitor and low-frequency items as well as their frequencies are added to the NewTrainSet for retraining the model (Line 8–12). We use the LCM to represent the Learned Sketch in this algorithm. The LAS can be modified to a dynamic version in a similar way, and we will not repeat that again. Except for the way of getting random high-frequency items, the process of modifying the LAS is similar to that of the LCM. The threshold γ in this algorithm is determined by the user. It can also be achieved by some outlier detection methods. Reference [34] is a good review of the outlier detection method. For instance, the simplest method based on “Standard Deviation” is to regard the value two or three standard deviations away from the mean of the historical KL divergence as an outlier. The problems of retraining the model and finding an optimal threshold are not the main points of this study. The benefit of retraining the model is also difficult to measure in theory. We just proposed a method to handle this problem, and the experimental result in Section 7.6 indicates that it works. We will study these problems in the future. 7. Experimental results In this section, we experimentally study the proposed algorithms. We compare the performance of the Learned CountMin Sketch (LCM) and Learned Augmented sketch (LAS) with the Count-Min sketch (CM) and Augmented Sketch (AS), since they are the most relevant works to our sketches. We also make a simple comparison of the LCM and an existing learningbased sketch. 7.1. Experimental Setup We will introduce the hardware, library, datasets, error metrics, the implement of frequency model, and some important parameters before we describe and analyze the experimental results. Hardware and Library All the experiments were conducted on a laptop with an Intel Core i5 CPU with 2.60GHz clock frequency and 8GB of RAM. We use Keras, a python deep learning library running on top of TensorFlow, to build the frequency models. Datesets We use both real and synthetic data sets with variable distributions. The first synthetic data set is a dataset containing 10M records generated from the Normal distribution N(50 0 0, 150 02 ). The second synthetic data set is a dataset containing 10 M records generated from the skewed Zipf distribution, whose probability mass function is defined as f (x ) = 1/(xα ni=1 (1/i )α ). The parameter α in this function is mentioned as the Skewness in our experiments. There are 10,0 0 0 distinct items in range [1,10 0 0 0] in this dataset. The third dataset is a real-life hourly dataset containing the PM2.5 data of the US Embassy in Beijing.1 There are 43,824 instances in this dataset. We take the attribute “PM2.5 concentration” to conduct our experiments, since the frequencies of the “PM2.5 concentration” have a typical distribution in real life that most of the frequencies concentrate into a small range of values. The fourth dataset is a real-life dataset called WESAD (Wearable Stress and Affect Detection) [32]. This dataset is a 16GB dataset containing 63 million records. We use the attribute called “RESPIRATION” in the dataset to conduct our experiments. 1
available from http://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data.
376
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
Fig. 5. Tuning the number of neurons.
We use the first three datasets to test the performance of our learned sketches. The fourth dataset is used to conduct a comparison of the LCM we proposed and a recent Learn CM proposed in reference [16]. Error Metrics We use two error metrics to measure the accuracy of each sketch. In the following definitions, Q, aˆi and ai denote the queries, the estimated frequency and the actual frequency, respectively. Average Relative Error(ARE): We accumulate the relative error of the estimation for each key item in the queries and compute the average of these relative errors.
ARE =
1
|Q | item ∈Q i
|aˆi − ai | ai
(2)
Mean Squared Error(MSE): We accumulate the squared error of the estimation for each key item in the queries and compute the average of these squared errors.
MSE =
1
|Q | item ∈Q
|aˆi − ai |2
(3)
i
We evaluate the performance of our sketch LCM and LAS from the following four respects: accuracy with ARE and MSE, space with consumed memory, efficiency with throughput (number of queries processed in one second), and the impact of the parameters on the accuracy with ARE and MSE. Implementation Details All the experiments are conducted in Python 3.5. The frequency models are build with Keras, a python deep learning library running on top of TensorFlow. For each dataset, we train the frequency model with the item frequencies calculated from a sample whose sampling rate is 5%. We run each experiment 10 times and report the average performance. We use a 3-layer artificial neural network with 1 hidden layer to model the relation between each element and its frequency. The hypothesis presented in reference [15] indicates that a three-layer artificial neural network (ANN) can represent any functions at any precision. We set the activation function as RELU [23], and set the optimizer as Adam [18]. RELU is a widely used activation function and yields better results compared to Sigmoid and Tanh. Adam is an algorithm for the firstorder gradient-based optimization of stochastic objective functions. It is computationally efficient, requires little memory, and is well suited for problems that are large in terms of data. We just use the default parameters of Adam following those provided in the original paper [18], since the authors announced that the hyper-parameters require little tuning. In addition, the main point of this paper is not to tune a perfect model. Therefore, we do not make efforts to change the parameters. We use the mean-squared-error(MSE) as the loss function, since it is the most widely used in regression. We set one hidden layer since we want to construct the model in a simple structure. Too many layers and neurons will increase the space cost of storing the model. The number of neurons is determined by tuning. We attempted to find an appropriate parameter to accommodate the model to three datasets used in our experiments. We compared the performance of the models with one hidden layer including a different number of neurons on the three datasets, and the results are shown in Fig. 5. We also compared the performance of the models with a different number of hidden layers. Fig. 6 shows the performance of one hidden layer with 10, 20, 30, 40 and 50 neurons, and the performance of two hidden layers in the shape of 5 × 5, 5 × 10 and 10 × 10 on the dataset in the normal distribution. We set the number of neurons as twenty, since it results in a good performance as shown in Figs. 5 and 6. A three-layer neural network with one hidden layer including twenty neurons is sufficient for the experiments in this study. We set 20 0 0 epoches to train the model. We conduct the training process 10 times for both the first dataset (normal distribution) and the second dataset (zipf distribution). The average time of training a frequency model with CPU is nearly 46 s. The GPU could accelerate the time. We can also use a more complex neural network to model a more complex frequency model. Some other machine learning methods including random forest (RF) and Support Vector Machine (SVM) can also be used as the frequency model. However, the cost of super-linearly training SVM scales with the number of examples. We did not adopt it, since the cost of the frequency model influences the entire space cost of the sketch. However, the SVM can be adopted for the training set in a small size. RF is an effective machine learning model. ANN, SVM and RF are all
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
377
Fig. 6. The impact of the number of layers and neurons on the space cost and accuracy.
Fig. 7. Frequency model based on different machine learning models.
valuable machine learning methods, and the best machine learning technique cannot be chosen a priori. Some works make a comparison among these methods [25,42]. Even though choosing the best model is not the point of this study, we still make a simple comparison of these three models on the datasets in the experiments. The comparison results are shown in Fig. 7. The space costs of the RF and ANN in this experiment are nearly the same. We can learn that both ANN and RF perform well, and we choose ANN in the following experiments. Parameters The parameters mentioned in our experiments are shown as follows. Filter size: the number of items in the filter of AS and LAS. t: the rate of the total frequencies of low-frequency items stored in the backup Count-Min sketch from the total frequencies of all the items in LCM and LAS. θ : the coefficient of the offset added to the frequency of items for training the model in LCM and LAS. ( , δ ): the parameters of the Count-Min sketch. They determine the size of the two-dimensional array in the CM sketch. The width w = e/ and the number of hash-functions d = ln(1/δ ). 7.2. Accuracy We compare the four sketches, CM, LCM, AS and LAS, for the synthetic Normal distribution, Zipf distribution (Skewness = 1) and the real PM2.5 distribution (short for “the distribution of PM2.5 dataset”) in the following experiments. For the Normal distribution and Zipf distribution, we set the filter size as 10 0 0, t = 0.2, θ = 0.01 and ( = 10−3 , δ = 10−6 ). For the PM2.5 distribution, we set the filter size as 50, t = 0.2, θ = 0.01 and ( = 10−2 , δ = 10−3 ). EXP1: Fig. 8 indicates the performance of the Count-Min (CM) and Learned Count-Min (LCM) sketches. We conduct the CM sketch and LCM sketch on the Normal distribution (Fig. 8(a)), Zipf distribution (Fig. 8(b)) and the PM2.5 distribution (Fig. 8(c)). The “ACT” in the figure means the actual frequencies of the data in different distributions. We compare the estimation frequencies captured by these two sketches with the actual frequencies (ACT). We keep the same space cost of these two sketches for the sake of controlling the variables. This figure indicates that our LCM sketch outperforms the
378
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
Fig. 8. CM vs. LCM.
Fig. 9. AS vs. LAS.
Count-Min sketch in the Normal distribution, Zipf distribution and PM2.5 distribution. The reason is that the trained model can provide estimations for a large number of key items in the queries, so that fewer items will be stored in the backup Count-Min sketch. In addition, the model reduces the collisions between the high-frequency items and low-frequency items in the backup CM sketch, which results in increasing the estimation accuracy of the low-frequency items. EXP2: Fig. 9 indicates the performance of the Augmented Sketch(AS) and Learned Augmented Sketch(LAS). We conduct the AS sketch and LAS sketch on the Normal distribution (Fig. 9(a)), Zipf distribution (Fig. 9(b)) and the PM2.5 distribution (Fig. 9(c)). We compare the estimation frequencies captured by these two sketches with the actual frequencies (ACT). We keep the same space cost of these two sketches for the sake of controlling the variables. This figure indicates that our LAS sketch outperforms the AS sketch in the Normal distribution, Zipf distribution and PM2.5 distribution. This reason is similar to that of the CM vs. LCM previously analyzed. It is noteworthy that the AS and LAS provide a higher accuracy of the highfrequency items compared with the CM and LCM. Since we accumulate the true frequencies for the items in the filter of the LAS, it is more accurate than the AS. In the experiment conducted using the PM2.5 distribution, the frequency estimation of the high-frequency items are not very different from each other in these four sketches. However, it is obvious that the LCM and LAS perform better than the CM and AS when estimating the frequencies of the low-frequency items as shown in Figs. 8(c) and 9(c). EXP3: Fig. 10 shows the accuracy comparison result of these methods on the datasets with different distributions. Fig. 10(a) and (b) show the ARE and MSE of the conducting sketches on the datasets in the Normal and Zipf distributions, respectively. Fig. 10(c) and (d) indicate the ARE and MSE of conducting the sketches on the PM2.5 dataset. We separate the estimation results of the PM2.5 distribution from the other two distributions since the MSE of its estimation results is much lower than those of the other two distributions. The CM sketch is more accurate for the skew data such as the Zipf distribution. The AS is more accurate than CM in these experiments, but it is shown in Fig. 10(a) that the advantage of the AS compared to the CM is more evident in the normal distribution than in the Zipf distribution. The reason is that AS provides more accurate frequency estimations for the high-frequency items in the filter and reduce the accuracy for the low-frequency items. Since most of the items in a Zipf distribution are low-frequency items, the higher relative error of these items results in the high ARE. It is also clear that ARE and MSE of the LAS and LCM are lower than those of AS and CM. The improvement of the LAS from LCM in Fig. 10(a) is not as evident as that in Fig. 10(b). The reason is similar to the previous analysis. However, LAS does not suffer very much from the reduction of the accuracy of the low-frequency items in the AS due to the reduction of items in the backup Count-Min. We can also learn from Fig. 10(c) and (d) that the LCM and LAS outperform the CM and AS on the real PM2.5 distribution.
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
379
Fig. 10. Accuracy comparison.
7.3. Space In this experiment, we evaluate the space consumption of the sketches. We set the four sketches including CM, LCM, AS and LAS with the same size. We use the Zipf distribution with the skewness = 0.5 in this experiment. We set the sizes of the sketches as 48 K, 92 K, and 184 K. For the sake of fairness, the filter and model in each sketch are also taken into consideration. For the LCM, the synopsis size is calculated as |CM| + | f model | and the synopsis size of the LAS is calculated by |CM| + | fmodel | + | f ilter|. Fig. 11 indicates the relation between the synopsis size and estimation accuracy. We still evaluate the accuracy with the two error metrics ARE and MSE, respectively. We observe from Fig. 11(a) and (b) that the LCM and LAS can reduce the required space without reducing the accuracy. The LCM and LAS of 48 K can provide a more accurate estimation result compared with CM and AS of 184 K. The accuracies of these four sketches all increase with the size of the sketches. The CM and AS suffer from a significant estimation error (both ARE and MSE) when the sketches are set in a small size, and collisions frequently occur. However, the LCM and LAS do not have that problem since a large number of high-frequency items are excluded in the backup Count-Min. 7.4. Efficiency In this experiment, we use the throughput of each sketch to measure their efficiencies. The throughput in our experiment means the number of queries processed each second. Fig. 12 illustrates the impact of the skewness on the efficiency. We test the four sketches on the Zipf distributions with the skewness varying from 0.5 to 1.5. The query keys in Fig. 12(a) are uniformly selected from the data range, meaning that each key in the data range are selected with the same probability. The query keys in Fig. 12(b) are uniformly selected from the data, meaning that each key in the data range are selected with a probability correlated to its frequency. We can learn from Fig. 12(a) that the distribution skewness has no impact on the efficiency of the CM and AS. There is no doubt that the CM is insensitive to the skewness, thus we just analyze the effect of the skewness on the AS, LCM and
380
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
Fig. 11. The impact of synopsis size.
Fig. 12. Efficiency of query processing.
LAS. Since the query keys are uniformly selected from the data range, meaning that each item in the AS is queried with the same frequency no matter if it is in the filter or in the backup CM. That is, the skewness does not affect the times of estimating the frequency from the backup CM. Therefore, it does not impact the throughput. We observe from the figure that the throughputs of the LAS and LCM decrease with the skewness. We review the structure of the LCM and LAS before explaining this phenomenon. We set a threshold t and use t · |total _ f requency| to limit the total frequencies of the lowfrequency items in the backup CM. The frequency of each low-frequency item decreases with the skewness. Consequently, more items will be stored in the backup Count-Min. Since each items is queried with the same probability, containing more items in the backup CM results in increasing the times of estimating the frequencies of items in the backup Count-Min. Since estimating the item frequencies from the CM sketch is slower than predicting according to the learned model, the increase in the data skewness leads to a decrease in the throughput. This is the reason for the effect of the skewness on the throughput of the LCM and LAS. We can also learn from the figure that the LAS benefiting from the filter is more efficient than the LCM. We can observe from Fig. 12(b) that the distribution skewness has no impact on the efficiency of the CM and LCM. There is no doubt that the CM is insensitive to the skewness no matter how the query keys are selected. We first analyze the effect of the skewness on the LCM. The total frequencies of the items stored in the backup Count-Min sketch are fixed by maintaining the same threshold t in the LCM. The reason is the same with what we discussed in the previous paragraph. In this experiment, the query keys are uniformly selected from the data, meaning that each key in the data range is selected with a probability correlated to its frequency. Since the total frequencies of items in the backup Count-Min of the LCM do not change with the skewness, the times of querying the backup Count-Min will not accordingly change. In addition, since the model responds faster than the backup Count-Min, the throughput of the LCM is higher than that of the CM. This figure also indicates that the throughputs of the LAS and AS increase with the skewness. This is because the highfrequency items in the filter are queried more often than the low-frequency items in the backup Count-Min, and the response time of the filter is shorter than that of the CM. LAS is still faster than AS, because the LAS uses the learned model to predict the frequency for a part of the remaining items instead of querying all the remaining items from the backup
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
381
Fig. 13. The impact of skewness.
Fig. 14. The impact of t.
Count-Min. The superiority of LAS over AS decreases with the skewness, since more queries will be answered from the filter when processing more skew data, leaving little space for improvement of the remaining few items. We conclude that there is an influence of the skewness and query frequency on the throughput. The LCM always performs faster than the CM, and the LAS always performs faster than the AS. The throughput of the LCM is insensitive to the skewness when the query probability of an item corresponds to its frequency. The LAS combines the features of the AS and LCM. On the one hand, the LAS benefits from the learned model when querying items uniformly selected from the data range. On the other hand, it benefits from the filter when querying items with a probability related to their frequencies. 7.5. The impact of parameters We test the impact of the parameters on the performance in the following experiments. The query keys in these experiments are uniformly selected from the data range. Skewness Fig. 13 shows the impact of the skewness on the estimation accuracy. We use a Zipf distribution with a skewness that varies from 0.5 to 1.5. The size of the LCM sketch is 92 K. We can learn from the figures that the AREs of the LCM and LAS increase with the skewness. The reason is that the number of low-frequency items increases with the skewness, meaning that more items will be stored in the backup Count-Min sketch when dealing with high-skew data. When the skewness increases to 1.5, the MSE of the four estimations are almost the same. Even though the increase of the skewness results in an increase of the relative estimation errors, it will not cause a high squared error, since the frequencies of these low-frequency items are quite low in the high-skew dataset. Threshold t In this experiment, we evaluate the impact of threshold t in the LCM sketch on the estimation accuracy. t indicates the rate of total frequencies of the low-frequency items stored in the backup Count-Min sketch. The experimental results are shown in Fig. 14. We use the Zipf distribution with the skewness = 0.5 to conduct these experiments. The size of the LCM sketch is 92 K. The threshold t varies from 0.1 to 0.3 in this experiment. We can learn from the figures that the errors increase with t, meaning that the estimation accuracy decreases with t. The reason is that the number of items
382
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
Fig. 15. The impact of θ .
Fig. 16. The impact of retraining the model.
stored in the backup Count-Min sketch increases with the threshold t. More items inserting into the backup Count-Min sketch results in a higher error. Accuracy parameter θ In this experiment, we evaluated the impact of the parameter θ in the LCM sketch on the estimation accuracy. θ indicates a coefficient of the offset added to the frequencies for training the model. The experimental results are shown in Fig. 15. We use the Zipf distribution with the skewness = 0.5 to conduct this experiment. The size of the LCM sketch is 92 K. θ varies from 0.01 to 0.03 in this experiment. We observed from the figures that both ARE and MSE increase with θ . The relation between ARE and θ is a linear correlation. This phenomenon coincides with the theoretical analysis, since θ introduces a linear relative error to the learned model. 7.6. Dynamic learned sketch In Algorithm 5, we add a Monitor supervising the change of the data distribution and retrain the frequency model in the learned sketches. In this experiment, we used a dataset as a combination of two datasets containing 10M records with different distributions. The first half of it conforms to a normal distribution N(50 0 0, 150 02 ), while the second half of it conforms to a uniform distribution in Random(0, 10 0 0 0). The Monitor in this experiments includes twenty items. Fig. 16 indicates the performance of retraining the model. Fig. 16(a) shows the impact of retraining the model on the KL divergence of the true frequencies and estimations of items in the Monitor. Fig. 16(b) illustrates the impact of retraining on the accuracy of the estimations. The blue lines in the figures show the performance of the initial sketch without retraining. We can learn from these two figures that the KL divergence of the frequencies in the Monitor is relatively stable when dealing with the first half of the dataset, and the MSE of the estimations sharply increases when dealing with the second half of the dataset. The threshold of the KL divergence used to determine retraining of the model is γ = 0.04 (noted by the black dotted line in Fig. 16(a)). The red lines in the figures show the performance of the dynamic sketch under supervision. When the KL divergence reaches
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
383
Fig. 17. Comparison of learning-based methods.
the threshold γ = 0.04, the sketch is retrained as introduced in Algorithm 5. We can learn from Fig. 16(b) that the MSE of the retrained sketch is lower than the initial sketch without retraining. 7.7. Comparison of learning-based methods We compared the performance of the LCM (shown as “MyLCM” in Fig. 17(b)) that we proposed with the “Learned CM” (shown as “PreviousLCM” in Fig. 17(b)) proposed in reference [30]. We have discussed the difference between these two methods in Section 2. In this experiment, we compared these two sketches on dataset “WESAD”. The distribution of this dataset is shown in Fig. 17(a). We compared the accuracy of these two sketches with different size. We can learn from Fig. 17(b) that the “PreviousLCM” is more accurate when the size of the sketch is large enough, while our sketch outperforms the “PreviousLCM” when the sketch is limited to a small size. The reason is that the “PreviousLCM” provides unique buckets for the heavy hitters to store their true frequencies, while we do not store the frequencies of the high-frequency items. When the sketch is limited to a small size, the unique buckets in the “PreviousLCM” will occupy a lot of space resulting in reducing the space for the Count-Min. However, “MyLCM” gets rid of the space cost of storing a large number of high-frequency items and their frequencies, leaving more space for the Count-Min. Therefore, “MyLCM” is better if the available space for constructing the sketch is limited to a tiny size. This property makes it possible to apply it to big data. However, we have to say that our method cannot provide the accuracy as high as that of the “PreviousLCM”, if the space for a sketch is large enough. The reason is that the error of the “PreviousLCM” approaches zero as long as sufficient space is provided to construct the sketch. However, the estimation error of our frequency model has little chance to approach zero. 7.8. The Summary of experimental results We summarize the experimental results as follows: • Learned Count-Min and Learned Augmented sketches outperform the traditional Count-Min and Augmented sketches. The learned sketches can provide a more accurate estimation with a lower space requirement. • Our Learned Augmented sketch with proper parameters is more efficient in estimating the frequencies compared with the CM and AS. • The accuracy parameter θ and the threshold t adversely affect the estimation accuracy. • LAS is better than LCM if the items in the filter are frequently queried. LAS is slightly worse than LCM if the lowfrequency items are frequently queried. • The superiority of our learned sketches is more evident when the sketch is limited to a small size. 8. Conclusion In this paper, we propose two learned sketches, the Learned Count-Min sketch and Learned Augmented sketch. We combine the machine learning methods with the traditional sketches to improve the performance. The experimental results indicated that our learned sketches outperform the traditional CM sketch and Augmented sketch. The learned sketches can provide a more accurate estimation with less space cost. Combining the machine learning methods with other techniques is an interesting and promising field, but it still needs optimization. In the next step of our study, we plan to improve and
384
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
perfect our sketches to automatically accommodate dynamic data distributions. We will also extend our methods and apply our idea to other synopses and more data analysis areas in our future study. Declaration of Competing Interest We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. Acknowledgements This paper was partially supported by NSFC grant U1509216, U1866602, 61602129 and Microsoft Research Asia. References [1] N. Abdoun, S.E. Assad, M.A. Taha, R. Assaf, O. Déforges, M. Khalil, Hash function based on efficient chaotic neural network, in: 10th International Conference for Internet Technology and Secured Transactions, ICITST 2015, London, United Kingdom, December 14–16, 2015, 2015, pp. 32–37, doi:10. 1109/ICITST.2015.7412051. [2] A.M. Ahmed, O. Duran, Y.H. Zweiri, M. Smith, Hybrid spectral unmixing: using artificial neural networks for linear/non-linear switching, Remote Sens. 9 (8) (2017) 775, doi:10.3390/rs9080775. [3] J. Chen, Q. Zhang, Bias-aware sketches, PVLDB 10 (9) (2017) 961–972. [4] G. Cormode, M.N. Garofalakis, P.J. Haas, C. Jermaine, Synopses for massive data: Samples, histograms, wavelets, sketches, Found. Trends Databases 4 (1-3) (2012) 1–294, doi:10.1561/190 0 0 0 0 0 04. [5] G. Cormode, S. Muthukrishnan, An improved data stream summary: the count-min sketch and its applications, in: LATIN 2004: Theoretical Informatics, 6th Latin American Symposium, Buenos Aires, Argentina, April 5–8, 2004, Proceedings, 2004, pp. 29–38, doi:10.1007/978- 3- 540- 24698- 5_7. [6] C. Estan, G. Varghese, New directions in traffic measurement and accounting, Comput. Commun. Rev. 32 (1) (2002) 75, doi:10.1145/510726.510749. [7] P. Flajolet, Approximate counting: a detailed analysis, BIT 25 (1) (1985) 113–134. [8] M.N. Garofalakis, P.B. Gibbons, Wavelet synopses with error guarantees, in: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, June 3–6, 2002, 2002, pp. 476–487, doi:10.1145/564691.564746. [9] R. Gemulla, W. Lehner, P.J. Haas, A dip in the reservoir: maintaining sample synopses of evolving datasets, in: Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12–15, 2006, 2006, pp. 595–606. [10] P.B. Gibbons, Y. Matias, V. Poosala, Fast incremental maintenance of approximate histograms, ACM Trans. Database Syst. 27 (3) (2002) 261–298, doi:10. 1145/581751.581753. [11] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M. Strauss, Surfing wavelets on streams: One-pass summaries for approximate aggregate queries, in: VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11–14, 2001, Roma, Italy, 2001, pp. 79–88. [12] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M. Strauss, How to summarize the universe: dynamic maintenance of quantiles, in: VLDB 2002, Proceedings of 28th International Conference on Very Large Data Bases, August 20-23, 2002, Hong Kong, China, 2002, pp. 454–465. [13] J. Gong, D. Tian, D. Yang, T. Yang, T. Dai, B. Cui, X. Li, SSS: an accurate and fast algorithm for finding top-k hot items in data streams, in: 2018 IEEE International Conference on Big Data and Smart Computing, BigComp 2018, Shanghai, China, January 15–17, 2018, 2018, pp. 106–113, doi:10.1109/ BigComp.2018.0 0 024. [14] J. Guo, J. Li, CNN based hashing for image retrieval, 2015, arXiv:1509.01354. [15] K. Hornik, M.B. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Netw. 2 (5) (1989) 359–366, doi:10.1016/ 0893- 6080(89)90020- 8. [16] C. Hsu, P. Indyk, D. Katabi, A. Vakilian, Learning-based frequency estimation algorithms (2019). [17] J.C. Huang, B.J. Frey, Cumulative distribution networks and the derivative-sum-product algorithm, 2012, arXiv:1206.3259. [18] D.P. Kingma, J. Ba, Adam: a method for stochastic optimization, 2014, arXiv:1412.6980. [19] T. Kraska, A. Beutel, E.H. Chi, J. Dean, N. Polyzotis, The case for learned index structures, in: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10–15, 2018, 2018, pp. 489–504, doi:10.1145/3183713.3196909. [20] S. Kullback, R.A. Leibler, On information and sufficiency, Ann. Math. Stat. 22 (1) (1951) 79–86. [21] J. Liu, C. Wu, Z. Wang, L. Wu, Reliable filter design for sensor networks using type-2 fuzzy framework, IEEE Trans. Ind. Inf. 13 (4) (2017) 1742–1752, doi:10.1109/TII.2017.2654323. [22] Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, V. Braverman, One sketch to rule them all: rethinking network flow monitoring with univmon, in: Proceedings of the ACM SIGCOMM 2016 Conference, Florianopolis, Brazil, August 22–26, 2016, 2016, pp. 101–114, doi:10.1145/2934872.2934906. [23] V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines, in: Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21–24, 2010, Haifa, Israel, 2010, pp. 807–814. [24] S. Nath, P.B. Gibbons, S. Seshan, Z.R. Anderson, Synopsis diffusion for robust aggregation in sensor networks, TOSN 4 (2) (2008) 7:1–7:40, doi:10.1145/ 1340771.1340773. [25] R. Nijhawan, B. Raman, J. Das, Meta-classifier approach with ANN, SVM, rotation forest, and random forest for snow cover mapping, in: Proceedings of 2nd International Conference on Computer Vision & Image Processing - CVIP 2017, Roorkee, India, September 9–12, 2017, Volume 2, 2017, pp. 279–287, doi:10.1007/978- 981- 10- 7898- 9_23. [26] H.B. Pasandi, T. Nadeem, Challenges and limitations in automating the design of MAC protocols using machine-learning, in: International Conference on Artificial Intelligence in Information and Communication, ICAIIC 2019, Okinawa, Japan, February 11–13, 2019, 2019, pp. 107–112, doi:10.1109/ICAIIC. 2019.8669008. [27] G. Pitel, G. Fouquier, Count-min-log sketch: approximately counting with approximate counters, 2015, arXiv:1502.04885. [28] P. Roy, A. Khan, G. Alonso, Augmented sketch: faster and more accurate stream processing, in: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, 2016, pp. 1449–1463, doi:10.1145/2882903.2882948. [29] P. Roy, J. Teubner, G. Alonso, Efficient frequent item counting in multi-core hardware, in: The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, Beijing, China, August 12–16, 2012, 2012, pp. 1451–1459, doi:10.1145/2339530.2339757. [30] F. Rusu, A. Dobra, Sketches for size of join estimation, ACM Trans. Database Syst. 33 (3) (2008) 15:1–15:46, doi:10.1145/1386118.1386121. [31] S.E. Schechter, C. Herley, M. Mitzenmacher, Popularity is everything: a new approach to protecting passwords from statistical-guessing attacks, 5th USENIX Workshop on Hot Topics in Security, HotSec’10, Washington, D.C., USA, August 10, 2010, 2010. [32] P. Schmidt, A. Reiss, R. Dürichen, C. Marberger, K.V. Laerhoven, Introducing WESAD, a multimodal dataset for wearable stress and affect detection, in: Proceedings of the 2018 on International Conference on Multimodal Interaction, ICMI 2018, Boulder, CO, USA, October 16-20, 2018, 2018, pp. 400–408, doi:10.1145/3242969.3242985. [33] C. Shen, Y. Li, Y. Chen, X. Guan, R.A. Maxion, Performance analysis of multi-motion sensor behavior for active smartphone authentication, IEEE Trans. Inf. Forensics Secur. 13 (1) (2018) 48–62, doi:10.1109/TIFS.2017.2737969.
M. Zhang, H. Wang and J. Li et al. / Information Sciences 507 (2020) 365–385
385
[34] I. Souiden, Z. Brahmi, H. Toumi, A survey on outlier detection in the context of stream mining: review of existing approaches and recommadations, in: Intelligent Systems Design and Applications - 16th International Conference on Intelligent Systems Design and Applications (ISDA 2016) held in Porto, Portugal, December 16–18, 2016, 2016, pp. 372–383, doi:10.1007/978- 3- 319- 53480- 0_37. [35] R.H.M. Sr., Counting large numbers of events in small registers, Commun. ACM 21 (10) (1978) 840–842, doi:10.1145/359619.359627. [36] D. Tong, V.K. Prasanna, Sketch acceleration on FPGA and its applications in network anomaly detection, IEEE Trans. Parallel Distrib. Syst. 29 (4) (2018) 929–942, doi:10.1109/TPDS.2017.2766633. [37] J. Wang, H.T. Shen, J. Song, J. Ji, Hashing for similarity search: a survey, 2014, arXiv:1408.2927. [38] Y. Wei, J. Qiu, H.R. Karimi, W. Ji, A novel memory filtering design for semi-Markovian jump time-delay systems, IEEE Trans. Syst. Man Cybern. 48 (12) (2018) 2229–2241, doi:10.1109/TSMC.2017.2759900. [39] Y. Wei, J. Qiu, P. Shi, H. Lam, A new design of h-infinity piecewise filtering for discrete-time nonlinear time-varying delay systems via T-S fuzzy affine models, IEEE Trans. Syst. Man Cybern. 47 (8) (2017) 2034–2047, doi:10.1109/TSMC.2016.2598785. [40] S. Wu, H. Lin, L.H. U, Y. Gao, D. Lu, Finding frequent items in time decayed data streams, in: Web Technologies and Applications - 18th Asia-Pacific Web Conference, APWeb 2016, Suzhou, China, September 23–25, 2016. Proceedings, Part II, 2016, pp. 17–29, doi:10.1007/978- 3- 319- 45817- 5_2. [41] T. Yang, H. Zhang, H. Wang, M. Shahzad, X. Liu, Q. Xin, X. Li, FID-sketch: an accurate sketch to store frequencies in data streams, in: World Wide Web-internet & Web Information Systems, 2018, pp. 1–22. [42] H. Yuan, G. Yang, C. Li, Y. Wang, J. Liu, H. Yu, H. Feng, B. Xu, X. Zhao, X. Yang, Retrieving soybean leaf area index from unmanned aerial vehicle hyperspectral remote sensing: analysis of RF, ANN, and SVM regression models, Remote Sens. 9 (4) (2017) 309, doi:10.3390/rs9040309. [43] H. Zhao, Y. Luo, An O(N) sorting algorithm: Machine learning sorting, 2018, arXiv:1805.04272. [44] Q. Zhao, M. Ogihara, H. Wang, J.J. Xu, Finding global icebergs over distributed data sets, in: Proceedings of the Twenty-Fifth ACM SIGACT-SIGMODSIGART Symposium on Principles of Database Systems, June 26–28, 2006, Chicago, Illinois, USA, 2006, pp. 298–307, doi:10.1145/1142351.1142394. [45] Y. Zhao, Y. Shen, A. Bernard, C. Cachard, H. Liebgott, Evaluation and comparison of current biopsy needle localization and tracking methods using 3D ultrasound, Ultrasonics 73 (2017) 206–220. [46] Y. Zhou, T. Yang, J. Jiang, B. Cui, M. Yu, X. Li, S. Uhlig, Cold filter: a meta-framework for faster and more accurate stream processing, in: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, 2018, pp. 741–756, doi:10.1145/3183713.3183726.