Journal Pre-proof Probabilistic data structures for big data analytics: A comprehensive review Amritpal Singh, Sahil Garg, Ravneet Kaur, Shalini Batra, Neeraj Kumar, Albert Y. Zomaya
PII: DOI: Reference:
S0950-7051(19)30407-1 https://doi.org/10.1016/j.knosys.2019.104987 KNOSYS 104987
To appear in:
Knowledge-Based Systems
Received date : 12 April 2019 Revised date : 21 August 2019 Accepted date : 21 August 2019 Please cite this article as: A. Singh, S. Garg, R. Kaur et al., Probabilistic data structures for big data analytics: A comprehensive review, Knowledge-Based Systems (2019), doi: https://doi.org/10.1016/j.knosys.2019.104987. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2019 Published by Elsevier B.V.
Journal Pre-proof *Conflict of Interest Form
AUTHOR DECLARATION
lP
pro of
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property. We further confirm that any aspect of the work covered in this manuscript that has involved either experimental animals or human patients has been conducted with the ethical approval of all relevant bodies and that such approvals are acknowledged within the manuscript. We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He/she is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs.
re-
urn a
We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author and which has been configured to accept email from (
[email protected] (Sahil Garg),
[email protected] (Ravneet Kaur),
[email protected] (Shalini Batra),
[email protected] (Neeraj Kumar),
[email protected])
Jo
Approved By all Authors.
(Authors belongs to different locations so signing a document is not possible at this stage)
Journal Pre-proof *Revised Manuscript (Clean Version) Click here to view linked References
pro of
Probabilistic Data Structures for Big Data Analytics: A Comprehensive Review Amritpal Singha , Sahil Garga , Ravneet Kaura , Shalini Batraa , Neeraj Kumara,∗, Albert Y Zomayab a Computer b Centre
Science & Engineering Department, Thapar University, Patiala (Punjab), India. for Distributed and High Performance Computing, The University of Sydney, Sydney, Australia.
re-
Abstract
An exponential increase in the data generation resources is widely observed in last decade, because of evolution in technologies such as-cloud computing, IoT, social networking, etc. This enormous and unlimited growth of data has led to a
lP
paradigm shift in storage and retrieval patterns from traditional data structures to Probabilistic Data Structures (PDS). PDS are a group of data structures that are extremely useful for Big data and streaming applications in order to avoid high-latency analytical processes. These data structures use hash functions to compactly represent a set of items in stream-based computing while providing
urn a
approximations with error bounds so that well-formed approximations get built into data collections directly. Compared to traditional data structures, PDS use much less memory and constant time in processing complex queries. This paper provides a detailed discussion of various issues which are normally encountered in massive data sets such as-storage, retrieval, query, etc. Further, role of PDS in solving these issues is also discussed where these data structures are used as temporary accumulators in query processing. Several variants of existing PDS
Jo
along with their application areas have also been explored which give a holistic view of domains where these data structures can be applied for efficient storage ∗ Corresponding
author. Email addresses:
[email protected] (Amritpal Singh),
[email protected] (Sahil Garg),
[email protected] (Ravneet Kaur),
[email protected] (Shalini Batra),
[email protected] (Neeraj Kumar),
[email protected] (Albert Y Zomaya)
Preprint submitted to Journal of Network and Computer Applications
August 21, 2019
Journal Pre-proof
and retrieval of massive data sets. Mathematical proofs of various parameters
pro of
considered in the PDS have also been discussed in the paper. Moreover, the relative comparison of various PDS with respect to various parameters is also explored.
Keywords: Big data, Internet of Things (IoT), Probabilistic Data Structures, Bloom filter, Quotient Filter, Count Min Sketch, HyperLogLog Counter, Min-Hash, Locality sensitive hashing
re-
1. Introduction
From the last few years, there is an exponential increase in the data. The amount of data being produced everyday from different sources such as-IoT sensors, social networks like Twitter, Instagram, WhatsApp, etc. has increased 5
from terabytes to petabytes. This voluminous data growth abetted with efficient
lP
storage and retrieval poses a big challenge for industry as well as academia [1]. To handle this large volume of data, traditional algorithms cannot go beyond linear processing. Moreover, traditional approaches demand that entire data should be stored in a formatted manner. These massive datasets require architectures and tools for data storage, processing, mining, handling and leveraging
urn a
10
of the information to offer better services. In the age of in-stream data [2] and Internet of things (IoT) [3], there is no limit on the amount of data coming from varied sources. Moreover, the complexity of data and the amount of noise associated with the data is not 15
predefined. Since the size of data is unknown, one cannot determine how much memory is required for storing the data. Moreover, the amount of data to be
Jo
analysed is in exabytes, which is too large to fit in the memory space provided with linear processing and actual storage of data is challenging. Thus, it is difficult to capture, store and process the incoming data within the stipulated
20
time [4]. Data sets with such characteristics are typically referred to as Big data. Various definitions have been used to define Big data from different perspectives. Machine learning is used in number of applications for optimization [5]. Further,
2
Journal Pre-proof
the trend of traditional data mining is shifting towards more complex task i.e.
25
pro of
correlated utility-based pattern mining [6]. In this paper we try to define Big data’s most relevant characteristics from data analytics view, referred as 9 V’s
Jo
urn a
lP
re-
model. The illustrative description about these V’s is depicted using Fig. 1.
Figure 1: Overview of Big data
Big data technologies are important in providing accurate analysis, leading to
more concrete decision-making; resulting in greater operational efficiencies, cost
3
Journal Pre-proof
reductions, and reduced risks for the business. To cope with Big data efficiently, new technologies appeared that enabled distributed data storage and parallel
pro of
30
data processing. Various technologies from different vendors include MapReduce by Google which provides a new method of analyzing data that can be scaled up from single servers to thousands of high and low end machines; NoSQL Big data systems which are designed to take advantage of new cloud computing 35
architectures to allow massive computations to be run inexpensively and efficiently; Amazon Azure, etc., which provide various tools to handle Big data. Along with the above mentioned technologies, Apache Hadoop (with its HDFS
re-
and MapReduce components) was a pioneering technology. Hadoop developed by Apache is an open source tool and most commonly used “Hadoop MapRe40
duce” is based on the Google’s MapReduce combined with Hadoop. Hadoop is a package of many components, which come in various formats which include
lP
Apache hive: infrastructure for data warehousing, Apache oozie: for scheduling Hadoop job, Apache Pig: a data flow platform responsible for the execution of the MapReduce jobs, Apache Spark: 45
an open source framework used for
cluster computing, etc. Although Hadoop provides an overall package to the Big data analytics, with less technical background to operate, still there are
urn a
some issues which need optimized solutions in Hadoop. In Hadoop with a parallel and distributed algorithm, the MapReduce process large data sets. Data is distributed and processed over the cluster in MapReduce leading to increase 50
in the processing time and decrease om processing speed. Further, Hadoop supports batch processing only, it does not process streamed data, and hence overall performance is slower (Apache Spark supports stream processing). One of the major issues in Hadoop is that its programming model is quite restrictive
Jo
which prevents modifications in inbuilt algorithms easily. The efficient analysis
55
of in-stream data often requires powerful tools such as Apache Spark, Google BigQuery, High-Performance Computing Cluster (HPCC), etc. However, these tools are not suitable for real-time use cases where fast response is required such as-processing data in specific application domain, implementation of interactive jobs and models, etc. Recent research directions in the area of Big data process4
Journal Pre-proof
60
ing, analysis and visualization clearly indicate the importance of Probabilistic
pro of
Data Structures (PDS).
The use of deterministic data structures to perform analysis of in-stream data often include plenty of computational, space and time complexity. Probabilistic alternatives (Probabilistic Data Structures (PDS)) to deterministic data 65
structures are better in terms of simplicity and constant factors involved in actual run-time. They are suitable for large data processing, approximate predications, fast retrieval and storing unstructured data, thus playing an important
70
re-
role in Big data processing.
PDS are, tautologically speaking, data structures having a probabilistic component [7]. These probabilistic components are used to reduce time or space trade offs. PDS cannot give a definite answer, instead they provide with a rea-
lP
sonable approximation of the answer and a way to approximate this estimation. They are useful for Big data and streaming applications because they can de75
crease the amount of memory needed (in comparison to data structures that give exact answers) [8]. Different variants of PDS are highlighted in Fig. 2. In majority of the cases, these data structures use hash functions to randomize the
urn a
items. Because they ignore collisions so they keep the size constant, but this is also a reason why they cannot give exact values. Moreover, PDS offer several 80
advantages which are as given below: • They use small amount of memory (one can control how much). • They are easily parallelizable (hashes are independent).
Jo
• They have constant query time. Major focus of this paper is on role of Probabilistic Data Structures (PDS)
85
in the following scenarios: • Approximate Membership Query: Store bulk data in small space and respond to user’s membership query efficiently in the given space S.
5
re-
pro of
Journal Pre-proof
Figure 2: Overview of PDS
• Frequency Count: Find cardinality, i.e., number of cardinal (basic) mem-
90
lP
bers in a set in the massive data set.
• Cardinality Estimate: Count the number of times a data item has arrived in the massive data sets.
• Similarity Search: Identify similar items, i.e., find the approximately near-
urn a
est neighbors (most similar) to the query in the available dataset. Organization of paper: Section II provides detailed discussion of approximate 95
membership query using the most frequently used PDS, Bloom Filter (BF) and its variant Quotient Filter (QF). Section III discusses how frequency count problem is solved efficiently by the PDS named Count Min Sketch (CMS). Section IV provides an insight on cardinality estimation by using Hyper Log Log (HLL) counter along with a relative comparison and reviews of various variants of HLL. Section V discusses the PDS used for similarity search of massive Big data and
Jo
100
provides a detailed discussion on Min-Hash and family of Locality Sensitive Hashing (LSH) (Various families of LSH, based on the distance matrix used, have been discussed in the Appendix 1). Section VI summarizes the role of all the above mentioned PDS with respect to various parameters. Finally, Section
105
VII concludes the paper. 6
Journal Pre-proof
2. Approximate Membership Query
pro of
Given millions or even billions of data elements, developing efficient solutions for storing, updating, and querying them becomes different important especially when data is queried in some real time application. Using traditional data base 110
approaches which include performing filtering and analysis after storing the data are not efficient for real time processing. Since the size of data is bulky, which require large data structures, retrieval cost of even a small query is very high. Above mentioned issues clearly indicate that efficient storage and searching techniques are required for processing Big data. In applications where efficiency is more important than accuracy, use of probabilistic approaches and approxima-
re-
115
tion algorithms can serve as a key ingredient in data processing. Thus, problem of membership query is converted to approximate membership query, where
120
lP
probabilistic results are acceptable.
Definition 1: (Membership Query) For a given set S = {x1 , x2 , ..., xN } with N items, membership query confirms the presence of queried element xq in the set using deterministic approaches. The result of membership query is in binary
urn a
form, i.e., 1 indicates xq ∈ S and 0 indicates that xq ∈ / S. Space and computational costs of these approaches are depended on the size of dataset considered. 125
Definition 2: (Approximate Membership Query (AMQ)) For a given set S = {x1 , x2 , ..., xN } with N items, AMQ checks the presence of queried element xq in the set by using some approximation or probabilistic approach for fast results. Query returns results with some approximation; here xq indicates “possibly in set” or “definitely not in set”. The query complexity is independent of the size
Jo
130
of dataset and space complexity is significantly reduced. 2.1. Bloom Filters Storing bulk data in a small space and querying the items in the given space
S can be accomplished by the usage of BF, a randomized data structure that
7
Journal Pre-proof
135
supports set membership query.
pro of
Bloom Filter (BF) [9], a space efficient probabilistic data structure, is used to represent a set S ⊂ U of n elements which helps in approximate membership testing. It consists of an array of m bits, denoted by BF [1, 2...m], initially all bits set to 0. The filter uses k independent hash functions ∀j(1
their value ∀j(1
urn a
lP
re-
1, ∀j| (1 < j < k).
Figure 3: Insertion in Bloom filter
145
Given an item yi ∈ Q where Q is set of query elements, its membership is checked by examining whether the bits corresponding to hash functions in the BF [1...m] array. If all hash positions, i.e.,
j≤k j=1 hj (yi )
j≤k j=1 Hj (yi )
are 1
are set to 1, then y
is considered to be part of S other wise not. Its space efficient representation comes at the cost of false positives, i.e., elements can be erroneously reported as members of the set although they are
Jo 150
not. In practice, the huge space savings often outweigh the false positives if kept at a sufficiently low rate. Given a BF with m bits and k hash functions, insertion and membership
query time complexity is always O(k). Their detailed working has been illus-
8
re-
pro of
Journal Pre-proof
Figure 4: Querying in Bloom filter
155
trated in Figs. 3 and 4.
lP
For getting the false positives probability of a BF, it is assumed that hash functions used in Bloom are universal random functions, i.e., probability for selecting each BF bit is equally likely [9, 10].
urn a
160
When inserting an element into the filter, the probability that a certain bit 1 is not set to one by a hash function is . Since k hash functions are used, m 1 probability that none of them sets the specific bit to one is (1 − )k . m After performing n insertions in BF, the probability that a given bit is still zero is:
(1 − (1/m))
kn
(1)
Jo
And consequently the probability that the bit is one is: kn 1 1− 1− m
(2)
In querying process, if all the hash positions corresponding to the hash func-
165
tions ∀j(1
belongs to the set.
9
Journal Pre-proof
The probability of false positive, i.e., probability when the element is not
pro of
part of the set and BF claims thta it is part of set is given by: kn !k 1 1− 1− m
(3)
Using approximation principle it can be concluded that:
1 1− 1− m
kn !k
≈ (1 − e−kn/m )k
(4)
Thus false positive rate for BF is given by:
re-
fp = (1 − e−kn/m )k
(5)
Example: To illustrate the above stated proof for false positives in BF in better manner, lets assume a BF of size 109 bits needs to accommodate 108 elements
lP
using 5 hash functions. False positives for the given scenario are: 1 Probability that while inserting an element, certain bit is not set to 1 in BF is ( 1019 ).
2 Probability that certain bit is set to 1 (in step 1), is
1−
1 109
urn a
0.999999999.
3 Using 5 hash functions probability that certain bit is set to 1 is 1− 0.999999995.
1 5 109
=
=
4 After inserting 108 elements in BF, probability that a given bit is still zero 5×108 = 0.393469332. is: 1 − 1 − 1019
Jo
5 In querying operation in BF, query is done for all hash functions. So prob 5×108 5 ability of false positive is given as: 1 − 1 − 1019 = 0.009430928.
The false positive probability, fp , decreases as the size of the BF (m) increases
and fp increases as the number of elements increase. By using more hash functions the probability of collision decreases. Further, user can predefine false positives according to application’s requirement. The accuracy of BF depends 10
Journal Pre-proof
on the filter size m, the number of hash functions k, and the number of elements
follows:
pro of
n. To minimize the fp with respect to k, the optimal value of kopt is given as
kopt = (m/n)ln2
(6)
The space advantage of BF is dependent on the error rate acceptable for the application considered. To maintain a fixed fp , with the n number of elements, the size of a BF m is given by:
m = −(n × ln(p))/((ln(2))2 )
170
re-
2.1.1. Categories of Bloom Filter (BF)
(7)
Based upon the applications domains, many variants of BF have been proposed which can be broadly classified into four categories. • Static Bloom Filter (SBF): BF with fixed size of array is known as SBF.
lP
This type of BF has constant false positive rate and works efficiently on static data sets. Based on the number of elements(n), parameters like size 175
of BF (m) and number of hash functions used(k) can be decided. Moreover, another essential determinant is the potential range of the elements
urn a
to be inserted; if it is limited, a deterministic bit vector can do better. Distance-sensitive BF [11], weighted BF [12], etc., are some of the BFs which falls under this category. Some of the problems associated with
180
SBF are that collision rate increases exponentially as size of the incoming data increases and deletion operation is not allowed because a particular bit which is set to one earlier will be set to zero after deletion operation but this will unmark another elements too which had marked that bit as
Jo
one. Standard BFs are suitable for representing static Big data where size
185
is known in advance, i.e., it does not varies with time.
• Counting Bloom Filter (CBF): CBF introduced by Fan et al. [13] uses a counter instead of a bit array. Here, each position in array is a counter, allowing insertion and deletion operations on the CBF. Whenever an element is added or deleted from the CBF, the corresponding counters are 11
Journal Pre-proof
190
incremented or decremented respectively, eg. d-left CBF [14]. Although it
pro of
can be efficiently used for applications where deletion operation is required, it increases the memory overhead by a larger factor and determining value of counter is quite cumbersome process.
• Incremental Bloom Filter (IBF): The BFs which are adaptive in nature, 195
i.e., change their size according to the incoming data fall in this category. The basic idea of incremental BFs is to represent a dynamic set D with a dynamic bit matrix and accommodate incoming data by adding new filters at runtime. If rough estimate of the number of elements to be inserted
200
re-
is not available, then a hash table or an incremental BFs is the better option. Dynamic BFs [15], scalable BFs [16], etc., belong to this category. However, major drawback of these types of BFs is that query complexity increases as the size increases. Initial size of filter is important factor
lP
in such cases; assigning small initial size array leads to computational overhead, slice addition and query complexity overhead; on the other hand 205
using larger size for initial dynamic BF size may lead to memory wastage. • Ageing Bloom Filter (ABF): Some network applications require high-speed
urn a
processing of packets. For this purpose, BFs array should reside in a fast and small memory. In such cases, due to the limited memory size, stale data in the BF needs to be deleted to make space for new data. The answer
210
to such type of applications is Ageing BFs. These BFs work similar to Least Recently Used (LRU) cache. Stable BF[17], A2 buffering, double
buffering [18], etc., are some of the examples of ABF. In A2 buffering,
concept of buffering is used but with two filters. Initially data is filled
Jo
in first filter and once the threshold exceeded, data is filled in next filter
215
but as soon as the threshold of second BF is crossed, data is evicted from first filter and this process continues. Advantage of this approach is that we can store data for more time by using double memory than simple buffering approach. Major flaw in such filters is that size of filter used is static so approximate estimate of incoming data is required in advance 12
Journal Pre-proof
220
to decide the size to filter. Moreover, sometimes these filters show false
pro of
negative results. 2.1.2. Applications
Initially BF was used to represent words in a dictionary. Gradually, BF was widely used in many networking and security algorithms like authentica225
tion, IP trace-backing, string matching, reply protection [10], etc. Presently it is used in fields as diverse as accounting, monitoring, load balancing, policy enforcement, routing, clustering, security of network [19]. Cellular networks use
re-
device-to-device communication using BF based approach to identify mobile applications [20]. BF is applied in VANET applications and cloud platforms 230
for DDoS attack prevention [21] and privacy preservation [22]. Recently BF has been implemented for Controller Area Network (CAN) for efficient Intrusion Detection which prevents various replay and modification attacks [23]. A
lP
solution for Hot spot tracking problem for streaming data has been proposed using Time-decaying Bloom Filters (TBF) and online sampling technology [24]. 235
Bloom filters are also being used in the field of bioinformatics to classify DNA sequences. A solution based on Multiple Bloom Filters (MBFs) is implemented
urn a
for Pattern Matching in DNA Sequencing data which optimizes the search for location and frequency of a specified pattern [25]. Industrial applications: Technical giants are also using BF and its variants in 240
different fields. Quora uses a shared BF in the feed back-end to filter out stories that people have seen before. In Facebook, type-ahead search, fetching friends and friends of friends to a user typed query is performed using BF. It uses a BF to avoid redirecting users to malicious websites. Oracle uses BFs to perform
Jo
Bloom pruning of partitions for certain queries. Apache HBase uses BF to
245
boost read speed by filtering out unnecessary disk reads of HFile blocks which do not contain a particular row or column [26]. Few concerns related to BF are: it is not possible to retrieve original key after hashing in BF and there is a probability of false positives and false negatives in some variants which support
13
Journal Pre-proof
250
pro of
deletion. Many variants Fuzzy-folded Bloom Filter [27], ID Bloom Filter [28], UltraFast Bloom Filter [29], r-Dimensional Bloom Filter [30], Magic Cube Bloom Filter (MCBF) [31] etc. have been designed and discussed in literature based on various applications and their requirements. Table 1 provides a detailed literature review of variants of BF along with their application domains. These 255
variants allow the end users to choose one of the variants of BFs based upon
Jo
urn a
lP
re-
their usage in different applications.
14
Chang et al. [39]
[36]
10.
Shanmugasundaram et al.
7.
Zhong et al. [38]
Goh E. J. [35]
6.
9.
Bloom Filter
Kumar et al. [34]
5.
Chazelle et al. [37]
Cohen and Matias [33]
4.
8.
Hierarchical
Michael Mitzenmacher [32]
3.
SBF
Bloom
Bloom
Filter Banks
Filter
Split
Bloomeir Filter
Filter
Secure
Bloom Filter
Space-coded
filter
Spectral Bloom
Bloom filter
Compressed
Bloom Filter
Counting
SBF
IBF
SBF
SBF
SBF
SBF
CBF
SBF
CBF
re-
• Membership query operation in
• Matrix of s × n is used, where s is
Continued on next page
works
mines which element of set S matches with element X.
• Routing and forwarding in net-
• Mapping of elements and sets deter-
number of BFs used.
mated cardinality of data set and n is
dynamically increasing data set.
stored in BF
• Number of BF are used in pipeline
predefined constant based on the esti-
• Membership of function values
• Sub string matching
• Privacy preserving applications
• Blind streaming
pro of
• Anomaly detection
• Measure per-flow traffic
• Supports frequency query
• Storing multisets
• Encode functions in BF
• Low false positive rate
• Hierarchical construction of BFs
random function twice to each element
• Secures indexes by applying pseduo
frequency query
• Represents a multi-set and support
ment
obtain the minimum count of an ele-
value corresponding to hash values to
• Filter increases the smallest counter
routing table over network
• P2P networks and distributing
• Web cache
• Compression of BF for transmission purpose
• Supports frequency query
• Used in conjunction with web
• Frequency query for an element can caches
naries
to represent sets
of simple bits
• Search in databases and dictio-
• Uses compact and probabilistic way
be answered by using counters instead
Areas of applications
Special Features
lP
Category
urn a
Fan et al. [13]
2.
Bloom Filter
Burton H. Bloom [9]
Filter Name
1.
Jo
S.No. Authors
Table 1: Variants of Bloom Filter
Journal Pre-proof
Almeida et al. [16]
18.
[11]
Bruck et al. [42]
Kirsch and Mitzenmacher
15.
17.
Donnet et al. [41]
14.
Bruck et al. [12]
sensitive
Deng et al. [17]
13.
16.
Distance-
Bonomi et al. [14]
12.
Sig-
CBF
ABF
Bloom
Fil-
Filter
Scalable Bloom
Bloom Filter
Adaptive
Bloom Filter
Weighted
ter
Bloom
Bloom Filter
Retouched
filter
Stable
Bloom Filter
d- left counting
IBF
CBF
SBF
SBF
SBF
ABF number of zeros remain same in the
• Ensures that, with time, expected
• Less number of collisions are reported
Bloom filter in sub-tables.
• Used with counters by dividing
topology
applications
where information about large sets
network
• Used efficiently in distributed
data
• Suitable for dynamic or stream-
• Changes the size of BF according to
positive rate
per bound to maintain constant false
Continued on next page
ing Big data membership query
apriori
input requirement under a tighter up-
per bound of data is not known
• It uses Adaptive counters
• Suited for applications where up-
• Uses varying number of hash functions
• Zipf distributed data sets
which are queried more frequently
• Used in binary classification
database applications
ment improvement in network and
• Used for speed and space require-
route tracing monitors.
of nodes must be shared among
pro of
• More bits are allocated to elements
implemented using LSH
• Identifies closeness of an item in S,
• Less false negatives reported
• Random Bit clearing process
re-
ously seen in streaming set S.
• Helps to identify whether X is previ-
BF.
• Duplicate detection in streaming
•Fingerprint matching
• Elements lookups
• Network Flow management
• Works as Double Buffer. • Based on the flow of data
Areas of applications
Special Features
lP
Category
urn a
natures
Length
Lu et al. [40]
Variable
Filter Name
11.
Jo
S.No. Authors
Table 1 – Continued from previous page
Journal Pre-proof
Goel and Gupta [45]
Rothenberg et al. [46]
23.
24.
[18]
Guo et al. [15]
Bloom Filter
Kirsch and Mitzenmacher
21.
22.
Less
Ahmadi and Wong [44]
20.
conscious
SBF
SBF
Bloom Filter
Deletable
Filter
Layered Bloom
Bloom Filter
Dynamic
hash
Bloom Filter
Optimized
Memory-
CBF
SBF
IBF
SBF
re-
by adding same size filter as the origi-
the amount of incoming data increases
load balancer, firewalls, etc.
• Supports deletion without false neg-
Continued on next page
• Used in middlebox services like
ative
ing loops
gion with high collisions
• Used in source routing for avoid-
large Big data
• Supports frequency query for
tributed environment
ment removal by keeping record of re-
• Use probabilistic approach for ele-
layers, starting from deepest one.
element is added by querying multiple
• Keeps track of number of time the
ers.
• A layered BF with multiple BF lay-
nal one
• Used in dynamic sets in dis-
• Streaming data monitoring
work
• Cryptographic process in net-
tional cost is a critical factor.
• In applications where computa-
search.
classifier based on the tuple space
• Used in multidimensional packet
pro of
• Dynamically increases size of BF as
hashing
• Uses double hashing and partition
random hashing
• Uses single hash function through
tuning
• Popularity awareness with off line
for rarely queried ones
ing skewed distribution
• Show good results for data hav-
• Uses more hash functions for important elements and less hash functions
Areas of applications
Special Features
lP
Category
urn a
Bloom Filter
ity
Zhong et al. [43]
Data Popular-
Filter Name
19.
Jo
S.No. Authors
Table 1 – Continued from previous page
Journal Pre-proof
Francesco Concas et al.
27.
Francesco Concas et al.
30.
31.
A. Singh and S. Batra [51]
29.
Y. Hua and X. Liu [52]
[49]
Michael Mitzenmacher [50]
28.
[49]
Dautrich et al. [48]
26.
Bloom Filter
Generalized
Filter Name
ABF
SBF
• Document Search. • Useful when the query stream
only for encoding multiple sets
• BF is enhanced by making use of pre-
which makes use of bit-wise vectors and locality-sensitive hash functions.
• Extension of standard Bloom filter
tainty. Continued on next page
Big data with noise and uncer-
imate membership testing on the
• It can efficiently handle approx-
• Document Search
head.
• Web caching
nificantly in reducing memory over-
• Load Balancing
variable length which contributes sig-
• Bloom vector uses multiple BF of
iteration.
predict the size of Bloom filter for next
utilization in streaming data
pro of
array in which Kalman filter is used to
Bloom filter by making use of learning
• It performs better than Scalable
sent.
• Peak hour analysis and server
fixed distribution.
chine learning so as to model the data
sets that the Bloom filter has to repre-
can be modelled as coming from a
filter which is based on applying ma-
re-
• Web caching
• Load Balancing
less sensors and data stream.
tion and hint base routing in wire-
• Used for duplicate element detec-
utilizes equal-sized Bloom filters not
• Extension of standard Bloom Filter
Bloom Filter
SBF
SBF
IBF
SBF
SBF
• Works on time window protocol.
Sensitive
Locality-
Bloom Vector
Bloom Filter
Adaptable
Filter
Learned Bloom
Bloom Matrix
Bloom Filter
Decaying
is mandatory to make more space
ing them. in the filter.
applications where eviction of data
• Used to reduce error in streaming
• Two set of hash functions are used: one for setting bits, another for reset-
Areas of applications
Special Features
lP
Category
urn a
Laufer et al. [47]
25.
Jo
S.No. Authors
Table 1 – Continued from previous page
Journal Pre-proof
Jo
Mousavi
Singh et al. [27]
Liu et al. [28]
Lu et al. [29]
Patgiri et al. [30]
35.
36.
37.
M.Tripunitara [54]
N.
and
Bloom Filter
r-Dimensional
Bloom Filter
Ultra-Fast
ter
ID Bloom Fil-
Bloom Filter
Fuzzy-folded
IBF
SBF
SBF
ABF
lP re-
in every aspect
• Performs better than Cuckoo Filter
ability
• Exhibits high adaptability and scal-
positive rate
• It is quite fast and shows less false
terms of membership query speed
• It outperforms other variants in
Multiple Data (SIMD) techniques.
• Used for real-world network ap-
of Things environment (IIoT)
Continued on next page
ship queries
• Suitable for large scale member-
for efficient network link speed
• Applications in routers/switches
plications
pro of
• It makes use of Single Instruction
spective positions.
• It directly records the set ID at re-
positions in a filter
• Maps each element to k particular
another
commodate hashed data of one BF into
• Fuzzy operations are utilized to ac-
approach
• Incorporates fuzzy-enabled folding
BF.
• Finds usage in Industrial Internet
to be minimum
is used for differentiating between the sets that were a failure by the previous
cation where acceptable error need
• Access enforcement in authenti-
• A cascade of multiple Bloom fil-
filter
Cascade Bloom ters is employed wherein the next BF
tured P2P networks
dard BF as leaves.
IBF
• Global collaboration in unstruc-
sists of a tree structure having stan-
ter
• Informed routing in P2P Net-
• Dynamic structure based on the conworking
Areas of applications
Special Features
cept of Bloom partition tree which con-
IBF
Category
tion Bloom fil-
Dynamic Parti-
Filter Name
urn a
Sidharth Negi et al. [53]
34.
33.
32.
S.No. Authors
Table 1 – Continued from previous page
Journal Pre-proof
38.
Bloom
SBF
Category
urn a
(MCBF)
Cube Filter
Magic
Filter Name
use of spatial locality
• Improves the query speed by making
redistributing the items
• It leads to improve in accuracy by
ship queries for Multiple Sets
• Used for carrying out member-
• Items belonging to the same set are stored in different Bloom filters
Areas of applications
Special Features
lP re-
pro of
SBF: Standard Bloom Filter, CBF: Counting Bloom Filter, IBF: Incremental Bloom Filter, and ABF: Ageing Bloom Filter
Sun et al. [31]
Jo
S.No. Authors
Table 1 – Continued from previous page
Journal Pre-proof
Journal Pre-proof
2.2. Quotient Filter
pro of
BF and its variants work efficiently only when entire BF array resides in the main memory. If the size of BF exceeds the available RAM of system then 260
complexity due to number of operations required for fetching from main memory and checking different parts increases the query time manifold. If such process continues, BF will lose the purpose of its use. Another PDS named Quotient Filter (QF) [55] can be efficiently used for approximate membership query; it supports multi layer design, buffering and hash localization which results in fast
265
and efficient querying of the elements even in secondary memory.
re-
QF is a space efficient and cache friendly data structure which use quotienting technique of hashing to store a set S efficiently [56]. It supports insertion, deletion, querying and merging operations. The detailed working of the same is shown in Fig. 5.
lP
Each element x ∈ S is mapped to h(x), a primary hash function, for convert-
ing x into set of p bits called fingerprint of x, i.e., h(x) 7→ {0, ..., 2p −1} ⇒ f p(x).
f p(x) is an open hash table of size m = 2q buckets having (r + 3) bits per
bucket. It is used for storage where r denotes the least significant bits in f p(x) and q = (p − r) most significant bits in quotienting technique represent three
urn a
extra metadata bits used with each element. To insert fingerprint of an element f p(x) in QF , remainder fr ← (f p(x) mod 2r ) and quotient fq ← (bf p(x)/2r c) are computed, where fq denotes the index of bucket to be used for insertion and fr denotes the value to be inserted in bucket fq . The main advantage of QF over BF is that we can reconstruct the f p(x) from the fq and fr , where f p(x) is given by:
Jo
f p(x) = fq .2r + fr
270
(8)
Definition 3: In a quotient filter, for two given fingerprints f p(x) and f p(y), it is stated that if fq (x) < fq (y) then fr (x) is always stored before fr (y). The quotienting technique tries to generate a unique remainder and quotient
for every xi ∈ S although there are some chances of collision. On the basis of
21
re-
pro of
Journal Pre-proof
Figure 5: Quotient Filter [55]
remainder and quotient of f p(x), collisions are divided into two types:
lP
Soft, if fq (x) = fq (y) Collisions in QF = Hard, if {(fq (x) = fq (y)) &&(fr (x) = fr (y)}
(9)
In case of soft collision (when fq of two items collide, but they have distinct
urn a
fr ), linear probing is used as a collision resolution strategy, where remainders of different fingerprints having same fq are stored contiguously; called run in QF. If necessary, remainders associated with different fq are shifted and corresponding metadata bits are updated for each bucket. In QF, cluster refers to the sequence of one or more consecutive runs without having any empty bucket in between them. A cluster is immediately followed by a empty slot. Canonical slot for a fingerprint x ( f p(x)) is the original bucket for the insertion of fr (x) indicated
Jo
by fq (x). The terms run and cluster are used to identify suitable position to insert or query the element after various shifts have been performed with the use of metadata bits. The false positive in QF are encountered due to hard collisions(when fq and fr of two items collide). Assuming that h(x) is
22
Journal Pre-proof
Table 2: Significance of bits in QF is
continua-
bit
tion bit
is bit
shifted
Significance
0
0
0
Empty Bucket
0
0
1
Bucket is holding start of run that has been shifted from
0
1
0
φ (Not used)
0
1
1
Bucket is holding continuation of run that has been
1
0
0
Bucket is holding start of run that is in same bucket.
1
0
1
Bucket Bi is holding start of run that has been shifted
pro of
is occupied
its quotient (fq ) bucket (canonical slot).
shifted from its quotient (fq ) bucket (canonical slot).
from its quotient (fq ) bucket (canonical slot). Bi is also occupied with some fr but its remainder is shifted right. 1
0
1
1
1
φ (Not used)
re-
1
Bucket Bi is holding continuation of run that has been
shifted from its quotient (fq ) bucket (canonical slot). Bi is also a slot in same run but its remainder is shifted right.
lP
distributed uniformly, the probability of hard collision(P rHC ) is given by: P rHC = 1 − (1 − 275
p 1 n ) ≈ (1 − e−n/2 ) p 2
(10)
Metadata bits are used to find optimal location of elements which have been shifted from canonical slot because of soft collision; fr of element belongs to run
urn a
of a slot fq which is stored at different location. Most significant bit, referred as is occupied bit is set HIGH for ith bucket if for any f p(x) ∈ S quotient satisfy fq = i condition, i.e., ith bucket is canonical slot for some element in dataset.
280
Middle bit known as is contunuation bit helps the decoder in searching process to identify group of items belonging to same bucket. Least significant bit named as is shifted bit is used to identify where the fr (associated with ith bucket) is stored. The significance of these bits is provided in Table 2.
Jo
fq denotes the index of bucket in which element needs to be inserted or
285
queried. In insertion operation, suitable position, sp, to insert the remainder is at the end of run of bucket denoted by fq . For this, all elements after sp are shifted to right, same operations are repeated till the end of the cluster and then element is inserted and metadata bits are updated. For query operation in QF (where queried element is xq ) f p(xq ) is calculated and then corresponding 23
Journal Pre-proof
290
quotient fq (xq ) and remainder fr (xq ) are computed. Start of cluster contain-
pro of
ing fq is identified and then the start of run corresponding to fq is identified. In querying process, instead of shifting elements only remainder of queried element is checked in concerned run. Deletion process is reverse of the insertion operation. 295
Biggest advantage of QF is that in QF original data, although hashed while storing, can be retrieved back through quotienting hashing technique.
The time required in insertion and deletion process can dominate the advantage of using QF since single cluster is scanned. In each operations Chernoff
re-
bound can be used to limit the size of cluster.
Definition 4: For a QF of m slots, if number of items stored is α × m, then P r[∃(A cluster of length) ≥ k] < m−
(11)
lP
where is allowable error and α ∈ [0, 1) is a random variable. k, the limit of cluster length (derived from number of slots in QF [55]) is given by : k = (1 + )
300
ln(m) (α − 1)ln(α − 1)
(12)
The length of largest cluster can be controlled by setting value of m high and
urn a
α → 1.
2.2.1. Advantages of QF over BF • In QF all operations are cache friendly, only single cluster is modified in one operation. Since cluster size can be fixed (Eq. 11), cluster fits into
305
cache lines easily. Less data fetch-up operation is required for bulk data stored in secondary memory. In BF, secondary memory fetching time for
Jo
concerned bit for all hash functions increases the complexity of task.
• Since QF supports in-order or linear scan, results are obtained quickly as compared to BF constructed by adding new slices to existing one, thus
310
search complexity of BF is comparatively high.
• Resizing of QF is possible without rearranging all the hashed data which is not possible in BF. 24
Journal Pre-proof
• Merging of two QF into a larger one can be done easily and false positives 315
pro of
do not increase in this operation whereas merging in BF may amplify the error.
• QF performs deletion operation accurately whereas standard BFs does not allow deletion and variants which support deletion may include false negatives.
Variants of QF like cascade filter (CF), buffered quotient filter (BQF), etc. work on similar principle and support working with SSD memory [55].
re-
320
2.2.2. Applications
QF is widely used in network application. Deep packet inspection (DPI) is a platform to monitor the incoming and outgoing traffic on a data centre. Identifying the malicious user from the packets is a time consuming task. Moreover, this matching process consume a lot of memory and CPU resources. Al-Hisnawi
lP
325
et al. used QF to store the malicious users to make searching task fast and efficient as the size of the incoming data increases [57]. Dutta et al. proposed Streaming Quotient Filter (SQF), a quotient filter based streaming model to
330
urn a
count the duplicate entries in the streaming data with predefined fixed memory and fast search facility [58]. QF has been successfully implemented in automatic terms extraction for domain-specific corpora for fast results. It has also been used for warehouse management to locate the items efficiently [59]. Garg et al. have proposed an application of Quotient filter in VANETs. It is utilized for providing an Edge Computing-Based Security Framework used for carrying out Big 335
data Analytics [60]. Quotient Filters have also been used for providing Quality
Jo
of Experience (QoE) in wireless content delivery networks (CDNs). They contribute significantly in improving the accuracy and reducing the effort involved in the caching process [61]. A new approach called Fast Two-dimensional filter with hash table (FTDF-HT) has been proposed for efficient name matching in
340
Named Data Networking (NDN). This approach is also based on the concept of Quotient Filter [62]. 25
Journal Pre-proof
3. Frequency Count
pro of
Given a set of duplicated values, one needs to estimate the frequency of each value. The estimation for relatively rare values can be imprecise, however, fre345
quent values and their absolute frequencies can be determined accurately. When frequency count needs to be solved in sub-linear space, some approximation in result is tolerable provided the processing is fast. In streaming data, frequent item counting is sometimes called - approximate frequent item counting which is defined in the next section.
350
Definition 5: (- approximate frequent item counting) Given a data stream
re-
S = {x1 , x2 , ..., xn } of n items, where F is the set of all xi ∈ S with frequency greater than certain threshold, i.e., fi > ((ϕ − ) ∗ n) and ϕ is a random variable
used to defined threshold; fi denotes the frequency of ith item with as the
allowable error in the results.
Solutions provided for approximate frequent item counting are divided into
lP
355
two categories: counter based and sketches. Counter based solutions use counters and probabilistic counting mechanism in sub-linear space using fixed resources such as-memory and computational time. Frequent approximation al-
360
urn a
gorithm, Majority algorithm [63], Lossy-counting [64], Space-saving [65], etc., fall in this category. Sketches use hashing and approximation based algorithms to map a large data set into compact size s.t. size of sketch is much less then the size of dataset. Count-Min Sketch [66], Count Sketch [67], etc. fall in this category. A short survey of prevalent counting based algorithms is provided in Table III. 365
Among all the techniques mentioned in Table III, most robust, with less com-
Jo
putational cost, minimum memory requirement and most adaptive to answer frequency queries is a sketch data structure called Count Min Sketch (CMS). 3.1. Count-Min-Sketch CMS was proposed by Muthukrishnan and Cormode in 2003 and later im-
370
proved in 2005 [66]. It is one of the members in the family of memory efficient
26
Journal Pre-proof
PDS used to optimize the counting of the frequency of an element in lifetime
pro of
of a data set. It is a histogram in which one can store elements and associated counts. As compared to BFs, which represent sets, CMS considers multi-sets, i.e., instead of storing a single bit, CMS maintains a count of all objects. It 375
is called ‘sketch’ because it is a smaller summarization of a larger data set. The probabilistic component of CMS helps in achieving more accurate results in cardinality estimate as compared to counting BF which works with less space and time complexity. The counting BF works with one bloom filter of size m having maximum counter value MAX, while all k hash functions are updating in the same BF, which leads to more collisions and chances of more error in
re-
380
cardinality estimate using counting BF. In CMS combination of d BFs is used, in each row i.e. in each BF only one hash function is allowed to make changes and final decision of cardinality estimate is taken by considering all rows. All
385
lP
these optimizations in CMS help to reduce the deviation in cardinality estimate. Insertion process in CMS is similar to BF. Instead of using 1-D array, CMS uses 2-D array with w columns and d rows. These parameters are used to maintain the trade-off between space and time constraints and accuracy. Since one hash function Hi (.) is associated with each row i, d hash functions are
390
urn a
used for d rows . When an element x arrives, it is hashed to each row, i.e., ∀i(1
Jo
operations about insertion in CMS has been depicted in Fig. 6.
Figure 6: Insertion in Count Min Sketch
27
Journal Pre-proof
For the desired accuracy levels, two parameters (epsilon) and δ (delta) are
pro of
used to calculate w and d dimensions of a count min sketch.
(epsilon) is the measure of ‘error added to counts with each item added to 395
the CM sketch’. δ (delta) defines ‘with what probability one allow the count estimate to vary from error rate’.
Value of w and d are calculated as:
re-
e w=d e
1 d = dln( )e δ
(13)
(14)
where ln is natural log and 0 e0 is Euler’s constant. 400
To decrease the collision, pairwise independence is used for constructing a
lP
universal hash family.
CMS solves three type of data summarization problems [68]. First is point estimation where frequency of object a[i] in stream is estimated value of number of occurrences of a[i]; calculated by taking the minimum of all the respective counter values in CMS corresponding to that element. The basic insight here
urn a
405
is that there is possibility of collisions between elements, which may increment the counters for multiple items. Taking the minimum count results in a closer
410
approximation. Second is range sum where total frequency count of elements Pk lying in a defined range is returned, i.e., i=j a[i]. Third application of CMS
is to identify heavy hitters: given a stream of data arriving and a constant φ, it can find all items occurring more than φ × N times, i.e., ∀i, find a[i] > φ × N .
Jo
3.2. Count Min Sketch Analysis For an incoming data stream D and an element xi , actual frequency is
denoted by ai and a ˆi is estimated frequency of element by CMS, where , δ ∈
415
(0, 1) are accuracy parameter and confidence parameter respectively for a CMS
of w × d size [69]. 28
Journal Pre-proof
yj , a random variable, gives element count if hashed values for two different
yj =
pro of
objects H(ith ) object and H(j th ) object are equal but i 6= j : aj , if h(xi ) = h(xj ) 0,
(15)
otherwise
Estimated frequency aˆi is sum of actual count of object i (a constant value) 420
and count of object j having hash collisions with object i; the expected value of aˆi is:
X
xj
re-
aˆi = ai +
(16)
j6=i
E[aˆi ] = E[ai +
X
xj ]
(17)
E[xj ]
(18)
j6=i
lP
E[aˆi ] = ai +
X j6=i
Using values of xj in Eq. (18), E[xj ] for w × 1 counters is given by: E[xj ] = aj ∗ P [h(yi ) = h(yi )] + 0 ∗ P [h(yi ) 6= h(yi )]
(19)
urn a
With w × 1 counters, probability that collision will occur in i and j for a hash function is
E[xj ] =
aj w
(20)
Using results of Eq. (20) in Eq. (18), we get:
Jo
425
1 w.
E[aˆi ] ≤ ai +
X aj j6=i
E[aˆi ] ≤ ai +
where ||a||1 is L1 norm of ai ; ||a||1 =
w
||a||1 w Pn
i=1
(21)
(22) |ai |. Higher the value of w,
more will be the accuracy of CMS and more is the memory required for higher accuracy. 29
Journal Pre-proof
Using Markov inequality for c>0
430
pro of
||a||1 1 P {aˆi − ai } ≤ c ≤ w c
(23)
For a given value of |0 < < 1, and w = d e e
P [aˆi > ai + ||a||1 ] ≤
1 e
(24)
CMS uses O( 1 ) spaces, i.e., ≈ O( we ) and estimates frequency with error at most ||a||1 with probability at least (1 − δ), where δ =
1 e.
The above mentioned
analysis is for w × 1 CMS but one may have CMS with multiple rows, i.e.,
re-
w × d. P[Err] is probability of error in w × d CMS, where i is collision in ith row, given by:
P [Err] = P [∃i|i ] = 1 − P [∀i|¯i ]
(25)
lP
Since all estimates are independent of each other
P [Err] ≤ 1 −
1 ed
(26)
urn a
Confidence of getting error probability(δ) equal to (1 − δ), is given by: 1 d = ln δ
(27)
Final conclusion from the above analysis is that if d columns are maintained, the probability that the estimate deviates by at most ||a||i is at least (1 − δ). 435
3.2.1. Applications
Consider a situation where one has stream of data, e.g., updates to stock quotes in a financial processing system is arriving continuously which needs
Jo
to be processed and statistical queries are suppose to be answered in real-time. For efficient handling of such scenarios, one requires to perform fast and efficient
440
processing of streaming data in a single pass. CMS is quite useful in answering frequency query in such problems using small space with constant query time. CMS has been successfully used in graph base semi-supervised learning for large
30
Journal Pre-proof
scale data[70], finding Hot-IP and DDoS attacker in networks algorithms [71].
445
pro of
CMS-Tree, a data structure derived from CMS, has number of applications in Natural Language Processing [72]. Bonelli et al. have presented a counting framework based on probabilistic sketches and LogLog counters for estimating the cardinality of large multi-sets of data [73]. It can be efficiently used for online chain of processing of network devices running at multi-gigabit speeds. Zhu et al. have proposed an approach called Dynamic Count-Min sketch (DCM), 450
which is appropriate for dynamic data set and can provide accurate estimates
Jo
urn a
lP
re-
for point query and self-join size query [74].
31
3.
Motwani [64]
Manku
and
and
Boyer
2.
Moore [63]
Jo
Lossy Counting
rithm
Frequent algo-
Alogrithm
Majority
Based
Counter
Based
Counter
based
Counter
lP O(1)
O(1)
O(1)
exceeds
( k1 )
responding count
• Count is maintained for only those
tion of each bucket.
the extreme sides is done after calcula-
• Random decrement of all counters on
previous bucket CBn−1 is used as base.
• For the new bucket Bn , counter of
old in bucket counters CBi .
Continued on next page
data set and their cor-
elements which cross a defined thresh-
of all unique items in
• It keeps the track
fraction of total counts.
quency
of items whose fre-
• Provide the sequence
votes in n items
• Used to find majority
be applied
Areas where it can
ent elements.
O( 1 )
Query
Complexity
and
Update
pro of
Com-
and calculates the frequency of differ-
• It divides large data into Bi buckets
is not found.
O( 1 )
O(k)
plexity
Space
re-
decrements all the counters when item
Stores values in k counters only and
• It increments counter if item exists.
rithm.
• It is generalization of majority algo-
not selected is decremented by 1.
cremented, else value of index which is
• For exiting item counter value is in-
mented.
item is stored and counter is incre-
• If item is observed first time then
value zero for each item.
• It starts with counter having initial
Category Special Features
urn a
J. S. Moore [75]
1.
Variant
Name of the
S.No. Authors
Table 3: Variants of Counting Algorithm
Journal Pre-proof
6.
5.
4.
and
al. [65]
Mentwally
[66]
et
Min-
Space Saving
Sketch
Count
Count Sketch
Variant
Based
Counter
Based
Sketch
Based
Sketch
Com-
lP O( 1 )
O(1)
only k counters.
• When a new distinct item arrives, placed by it and counter is set 1.
Continued on next page
more than × n using
item with least counter value is re-
items having frequency
responding counters.
• It keeps the record of
rameters and δ
ror bound on the pa-
the pre-calculated er-
pro of
• It also guarantees
stored in sketch.
the Big data which are
all frequency queries on
• It helps to answer
items are stored by updating their cor-
• It stores k items only, first k distinct
quency of queried element.
from all the rows is picked as the fre-
• In querying process minimum value
rows.
date the value of new item in all the
matrix a hash function is used to up-
re-
sketch, corresponding to each row in
• A w × d matrix is used to store the
i=d [h(x)] 7→ {+1, −1}. eration), i.e., gi=1
O(log( n )) δ
items.
be performed on selected index (+1 is
log(n/δ) )
expected frequencies of
for increment and -1 for decrement op-
variance in actual and
• Use of extra set of
decide which type of operation needs to
O(log( n )) δ
be applied
Areas where it can
• It uses extra set of hash functions to
O(
Query
Complexity
and
Update
hash functions reduces
log(n/δ) O( min(2 ,1/k) )
plexity
Space
sketch is maintained.
• A 2-d matrix similar to Count min-
Category Special Features
urn a
Name of the
Jo
Muthukrishan
Cormode
[67]
Charikar et al.
S.No. Authors
Table 3 – Continued from previous page
Journal Pre-proof
7.
al. [76]
Matusevyc
S.No. Authors
et
Holusai
Variant
Based
Sketch
modates more number items in same space
half of the space with negligible error in
accuracy by applying fold operation.
• The folding operation allows merg-
ing of data according to time and items
tion and item aggregation.
Query
number
of folds
the
depends
on
time
complexity
Query
is O(log( n )); δ
Update time
Complexity
and
Update
time.
queries as a function of
events, e.g. streams of
statistics of arbitrary
• It provides real time
be applied
Areas where it can
pro of
of
re-
respectively; referred as time aggrega-
lP
but
fined threshold, data is preserved in
accom-
same as CMS
When counters of CMS reache a de-
is
space
Com-
complexity
Total
plexity
Space
representation of Count min-sketch.
• It is the advanced and compact
Category Special Features
urn a
Jo
Name of the
Table 3 – Continued from previous page
Journal Pre-proof
Journal Pre-proof
4. Cardinality Estimate
pro of
As the amount of data to be analysed increases, determining the cardinality can be an important factor, especially when the incoming data is dynamic and 455
amount of data is unknown. In multi-sets, determining exact cardinality is highly computation intensive process since it is proportional to the number of elements in the large data sets.
Probabilistic cardinality estimators used for determining approximate cardinality include LogLog [77], HyperLogLog [78], MinCount [79], Probabilistic 460
counting, etc. [80]. The same has been illustrated in Table 4. In all these esti-
re-
mators, hash functions are used to ensure randomization, leading to significant reduction in memory utilization and the cost paid is that approximate output is obtained instead of the exact one. These probabilistic estimators are based on two approaches: First is Bit-pattern observables (BPO), where certain patterns of bits obtained after hashing are observed and conclusions are drawn based
lP
465
on the these patterns. Probabilistic Counting, LogLog counter, Hyper LogLog counter, etc. use BPO principle for estimating the cardinality of set. Second approach is statistics based where Statistical Probability Methods (SPM) are
470
urn a
used to find the cardinality which include MinCount and approximate counting algorithm.
One of the most frequently used cardinality estimators for massive Big data is LogLog counter, a probabilistic counting based algorithm which uses 16 bits hash function to randomize data and convert it into uniform binary format [77]. The hashed data set obtained is used for cardinality estimates. Estimator 475
used in LogLog counter is the geometric mean of all the registers and Bernoulli
Jo
distribution is used to provide the final cardinality of the dataset. It estimates the cardinality in single pass and within defined error limits using small memory than the size of dataset. One of the major drawbacks of LogLog counter is that it is not efficient for the Big data containing outliers. HyperLogLog [78], an
480
advanced version of LogLog counter uses principle of stochastic averaging where 32 bits or 64 bits hash function is used compared to LogLog counter which uses
35
Journal Pre-proof
16 bits and considers harmonic mean of all the registers to eliminate the effect
pro of
of outliers. 4.1. Hyperloglog Counter 485
HyperLogLog (HLL) estimates the cardinality of large set by using small memory with the fixed number of registers r of size mr , where size determines the capacity of register to store count and all parameters are function of expected approximation. It has been proved by Flajolet et al. that HLL counter can count one billion distinct items with an error of 2% using only 1.5 KB of memory [78] (Proof of HLL accuracy is discussed in Appendix 1). In terms of functionality,
re-
490
HLL supports addition of elements and estimation of their cardinality but it does not support membership checking of specific elements as done in BFs and QFs.
495
lP
HLL algorithm is based on Bit-pattern observable principle, i.e., the cardinality of a uniform distributed multi-set (Z) of numbers is estimated by calculating the maximum number of leading zeros in the binary representation of each number in the Z. If the maximum number of leading zeros observed in beginning is n − 1, i.e., 0n−1 1, an estimate for the number of distinct elements 500
urn a
in the Z is 2n . If single counter is used for estimation then variance in result will be quite high. The solution proposed is to run same experiments m times with different hash functions and then take average; this reduces the variance and provides better estimation [81]. HLL uses principle of stochastic averaging where input stream is divided into r sub streams and if the standard deviation for each sub-stream is σ, then
505
the standard deviation for the averaged value is
σ √ . r
Also, harmonic mean is
Jo
used instead of average to normalize result and eliminate the effect of outliers. Such means have influence of taming probability distributions, i.e., have slow
decaying right tails, operating as a variance reduction device leading to more quality estimates.
510
The detailed working of HLL is shown in Fig. 7. Streaming data is managed
in input phase to compute the cardinality of dataset. This data is further 36
Journal Pre-proof
provided as input to hashing phase, where each data instance is hashed into
pro of
binary string of l bits. From these l bits, lower b bits are used to determine the register number to be updated and remaining (l − b) bits determine the 515
value to be updated in register. In register block r, registers having maximum counter value mr are maintained. Value of these registers are continuously updated according to hashed value of data instances. To provide the cardinality estimate after observing certain amount of data, estimator function is used, in which harmonic mean of all register values is considered to reduce the variance in the cardinality estimation.
urn a
lP
re-
520
Figure 7: Hyper log log framework
Jo
The input multi-set Z is divided into r sub streams z1 , z2 ...zr , where r is
number of registers used to store the values, given by r ← 2b — b ∈ Z and b > 0 and all registers in set R are initially set to −∞. Only one hash function is used to convert domain data to binary stream for Bit-pattern observation, i.e.,
H(Z) : D −→ {0, 1}∞ . If s ∈ {0, 1}∞ is a binary stream then δ(s) is function
which returns position of leftmost 1 in s. ∀zi |zi ∈ R, hashing is done to convert 37
Journal Pre-proof
sub-stream into binary string α (α ← H(zi )). First b bits of α, i.e., α1...b are
pro of
used to determine the register ri to be updated and remaining bits are used for register value, i.e., ri ← δ(αb+1,... ). Estimator function (E) of HLL is based on harmonic mean:
r X E = βr r 2 2−rj
(28)
j=1
where βr is a constant based upon the size of data set used to correct the systematic multiplicative bias.
The algorithm makes adjustment for small and very large cardinality sets
525
re-
by adjusting the value of βr . Every register ri ∈ R uses at most log2 (log2 (n) + O(1)) bits when cardinalities less than or equal to n need to be estimated. The resulting error is
1.04 √ . r
Accuracy of estimate is improved by increasing the
4.1.1. Applications
lP
number of registers in HLL [78].
HLL finds usage in different application domains like natural language pro530
cessing [82], biological data [83], large structured databases mining [84], networks used for traffic monitoring [85], security related issues in networks like
urn a
detection of worm propagation and detecting DoS (Denial of Service) attack [86], data analytics [87], etc. A generalization of virtual Hyperloglog (vHLL) finds applications in network traffic measurement and database systems by uti535
lizing compact memory [88]. HLL has also proved to be useful in computing
Jo
fast and accurate Genomic Distances [89].
38
Flajolet et al. [80]
Durand and Flajolet
3.
4.
Fusy and Giroire [79]
Flajolet et al. [78]
7.
[77]
Durand and Flajolet
6.
5.
M. Wegman [91]
2.
[77]
Robert Morris [90]
1.
sampling
BPO
SPM
Category
counter
Hyper-Log-Log
Min/Max Count
Super Log Log counter
Log-Log counter
Probabilistic Counting
(Wegman sampling)
Adaptive
BPO
BPO
BPO
BPO
BPO
urn a
algorithm
Approximate Counting
Name of the Variant
Jo
Authors
S.No.
re-
Observable are
duced to eliminate the outliers.
averaging is used and harmonic mean is intro-
• A 32 bit hash function along with stochastic
ror rate.
counter with low latency and well defined er-
• Handle larger Big data compared to log-log
hashed values according to type of counting.
Continued on next page
etc. for cardinality estimation [78]
• Used by Google, Redis, Amazon
sive data sets
which consider multiple hash functions instead
of one and take Min or Max value from the
• Estimating cardinalities of mas-
sive data sets
pro of
• Estimating cardinalities of mas-
sive data sets
• Estimating cardinalities of mas-
• Data mining of Internet graphs
sets
• Estimating cardinalities of multi
• Extended form of probabilistic counting
ticular interval).
Restriction rule (use register values from a par-
tion rule (while selecting values register) and
• Log log counter is combined with Trunca-
cation and ordering of data in files.
linked with cardinality, independent of repli-
is selected from right side.
• Observable bit to set register value high
servable.
first 1 from leftmost side is considered as ob-
lP
• Hashing of data is done in binary format and
tribution of data.
of multisets and are independent from the dis-
• Samples are taken uniformly over the domain
• Uses probability based counters.
estimate (in one scan)
• Approximate and fast cardinality
• Developed in Bell labs on the basis of theory that log2 N bits need to be counted till N.
Areas of Applications
Special Features
Table 4: Variants of cardinality estimate
Journal Pre-proof
Heule et al. [92]
8.
Name of the Variant BPO
Category
ity estimation regime.
lP re-
pro of
representation in registers and small cardinal-
applications [92].
• Implemented by Google in many
• Improvement over hyper log log by implemented 64 bit hash functions leading to sparse
Areas of Applications
Special Features
SPM: Statistical Probability Methods, BPO: Bit-Patterns Observable
urn a
counter
Hyper-Log-Log++
Jo
Authors
S.No.
Table 4 – Continued from previous page
Journal Pre-proof
Journal Pre-proof
5. Similarity Search
pro of
Finding similar items in a set is a process of checking all items and identifying the closest one. To categorize the data set into particular class, one needs to 540
find how much two items of data set are similar to each other. Problems related to finding of similar items are often solved by identifying nearest neighbors of object. Such type of problems have number of mathematical solutions in terms of distance measures like hamming distance, cosine and sine similarity measure, Jaccard’s similarity coefficient, Pearson’s similarity coefficient, etc. [93].
545
Searching a huge database for similar items using linear search or brute force
re-
approach increases the computational complexity exponentially. Such solutions are efficient for small data sets but when massive data sets are considered, all such solutions face two major problems: first is how to store and represent items for finding similarity in massive data sets. Second is how to pairwise compare billions of items especially in such high dimensional data sets [94].
lP
550
Solution for above stated problem is to either reduce dimensionality of the data set or make structural assumptions about the data for maintaining integrity of data as done in data structures like trees and hashes. Trees show
555
urn a
good results for low dimensional data, but as the dimensions of data increases, query complexity and tree construction cost becomes too much. In hashes, data is mapped on hash table using random hash functions. Items mapped close in hash table are assumed to be close neighbor, but type of hash functions used and hashing collisions can significantly affect the results. Finding nearest neighbor in Big data with n points and d dimensions using linear search has O(nd) 560
complexity at its best.
Jo
One of the known solution proposed by researchers for finding nearest neighbor in Big data with n points and d dimensions is k-d tree. K-d trees or kdimensional tree, proposed by Jon Bentley in 1975, is a binary tree where every node is a k-dimensional point and each level of the tree represents one dimen-
565
sion. Each level of a k-d tree splits all children along a specific dimension, using a hyperplane that is perpendicular to the corresponding axis. For insertion or
41
Journal Pre-proof
query operation k-d tree needs recursive scanning of the tree which is quite
pro of
time consuming task. In the worst case, search is close to scanning whole tree. Balancing operation in k-d tree requires extra computational effort to generate 570
sorted k-d tree in multiple dimensions. With an increase in dimensions, new levels need to be added, increasing the size of kd-tree exponentially.
Further, nearest neighbour problem with approximation rules for d dimensional dataset is defined as following:
575
Definition 6: (c-approximate NN problem [95]) Let
re-
d dimensions and Q ⊂
a point p ∈
Definition 7: ((R,c)-NN problem [95]) Let
lP
580
d dimensions with a constant R > 0 and Q ⊂
any query q ∈ Q following decision related to closeness can be made: 0
• if ∃p ∈
585
urn a
c.R.
• if d(p, q) > c.R, ∀p ∈
In the era of Big data, problems like similarity search on large data needs reliable, fast and computationally efficient solutions. One of the solutions to the above mentioned similarity search problem is given by a hashing based sampling algorithm called Min Hash (MH).
Jo
590
Min-Hash, a probabilistic technique given by Andrei Broder in 1997 [96],
is used to find similarity between two items by computing Jaccard similarity J(A, B) between the items being considered for finding similarity. To find the similarity between members of a set S = s1 , s2 , ...sn Min-Hash
uses set of k hash functions Hk (S) 7→ Z. hmin which stores the minimum value 42
Journal Pre-proof
from the set of hash function hmin ← min(Hk (.)). Two elements s1 and s2 of
pro of
set S are considered similar if hmin (s1 ) = hmin (s2 ) [96]. P r[hmin (s1 ) = hmin (s2 )] ⇒
|s1 ∩ s2 | ≈ J(s1 , s2 ) |s1 ∪ s2 |
(29)
If mH is a random variable, similarity of two items s1 and s2 is given by:
mH =
595
1, if h(s1 ) = h(s2 ) 0, otherwise
(30)
Here mH ∈ (0, 1) is unbiased estimator of similarity. High variance in mH is
re-
reduced by averaging the number of observations.
Number of variants have been proposed to improve the performance of MinHash by maintaining its simplicity. Some important variants are: k-min Hash [97] : As compared to Min -Hash where single hash function is used, k-min Hash uses k hash functions and MIN and MAX are found. For two
lP
600
set A = {a1 , a2 ..., an } and B = {b1 , b2 , ..., bm }, Jaccard similarity between them using k hash functions is given by:
i=(M AX(n,m))
J(A, B) = ∀i=1
M IN (hi=k i=1 (ai , bi )) i=k (a , b )) M AX(hi=1 i i
(31)
urn a
It is more accurate as compared to Min-Hash because it uses k hash functions instead of single hash function.
Min-Hash Sketch (MHS) [98] : The term ‘sketch’ indicates summary of a large set. In Min-hash sketch, k hash functions are calculated for each set and k1 (k1 ⊂ k) hash functions with minimum values are stored in MHS matrix.
To compute similarity on a collection of sets, i.e., A¯ ← (A1 , A2 ...An ), MHS is
Jo
constructed as follows:
605
1−k1 M HS[i, j] ← ∀i=n (hj=k i=1 (M IN j=1 (Ai )))
(32)
This sketch is used to compute the similarity of any pair of documents by
comparing their associated minimum values. Weighted Min Hash [99]: This technique is used for similarity search in
textual data (applications domains such as document retrieval, text mining and 43
Journal Pre-proof
web search etc.) where entire text is divided into fixed size character sets using
pro of
shingling (discussed in coming paragraph) or varying size text called tokens. In weighted Min-hash different weights are assigned to the tokens generated from the text. Frequently used approach for weighting tokens in document retrieval is the Inverse Document Frequency (IDF). For token t, weight is computed as: w(t) = log(1 +
N ) nt
(33)
where nt is the number of times token t appeared in all N documents. Weighting approach maintains equilibrium by putting small weights on frequent tokens
re-
and large weights on rare tokens. This unequal assignment of token weights decreases the value of common tokens and allows more informative tokens to pop out, leading to significant improvement in accuracy of results retrieved. Jaccard similarity for weighted Min-hash is given by:
i=k M IN (hi=1 (ai , bi ))(w(t)) i=k M AX(hi=1 (ai , bi ))(w(t))
lP
JW (A, B) =
(34)
The weighted Jaccard similarity is a natural generalization of Jaccard similarity. It will become simple Jaccard similarity if all weights are set as 1. Variants of Min-hash are used for character level or lexical matching, not for contextual matching.
urn a
610
For finding similar items in massive Big data, Min-hash represents bulk data into compressed form called signature matrix and Locality Sensitive Hashing (LSH) is used to shortlist and narrow down pairwise comparison by identifying the pairs of possible similar items in the dataset. 5.1.1. Steps used in Min-hash
Shingling: A document is a string of characters. The most effective way
Jo
615
to represent documents as sets is convert it into small strings, for the purpose of identifying lexically similarity between documents. k-shingle represents a substring of length k found within the document. In shingling two major issues are faced: first is how to pick size of k and second is which method to usedto convert document into shingles. 44
Journal Pre-proof
Shingle is a contiguous sub-sequence of tokens of length k. (k can vary according
pro of
to application). However, if value for k selected is too small, then it is expected that most sequences of k characters will appear in most of the documents. Thus these type of shingles-sets lead to high Jaccard similarity between unrelated documents. But if value of k selected is very high, then matching of shingles with other documents have very low probability, which again leads to erroneous results. Thus, k should be large enough that probability of any given shingle appearing in any random document is low. Value of k is decided by:
(35)
re-
ck >> l
where c is number of available characters and l denotes the average length of document.
Considering second issues, there are many methods to convert documents into
620
lP
shingles like remove all spaces from the document and then pick string according to shingle size; another approach considers space character as well while calculating to shingles; in another approach first all stop words are removed from the document and then shingling is done; and in hashing based approach, instead of using substrings directly as shingles, we can pick a hash function
625
urn a
that maps strings of length k to some number of buckets and treat the resulting bucket number as the shingle. The set representing a document is then the set of integers that are bucket numbers of one or more k-shingles that appear in the document.
To perform similarity search between items, pre processing of data is required to adjust the data into the compressed form for space saving, uniformity and 630
fast results.
Jo
Characteristic Matrix (CM): Shingles computed for each document are hashed
to compute Jaccard similarity and the matrix set generated from them is called characteristic matrix. For a set of documents S = {D1 , D2 , ..., Dn }, each document having m singles, CM is defined as a binary (m × n) matrix. Rows
635
of CM represent the values of shingles and columns represent the documents. CM [i, j] = 1 in a matrix denotes that ith shingle is present in j th document. 45
Journal Pre-proof
Signature Matrix (SM): (s×n) signature matrix (SM) is derived from (m×n)
pro of
(CM ) having similarity same as that of entire set, where (m >> s). To generate SM, a hash function φ(.) is used which picks a row randomly from SM and 640
then rows are permuted across the columns to generate more random results. Repeating this process m times, a (m × n) signature matrix is generated. This process is followed since it is difficult to store the characteristic matrix and make pairwise comparisons for huge amount of entries. Min-Hashing is used for similarity preserving summarization of sets, i.e., compact representation of
645
large data set in a smaller one with minimum loss of information. SM generated
re-
by Min-Hash act as input for LSH. 5.1.2. Applications:
Initially Min-hash was proposed for document similarity search engine Altavista [100] for grouping similar documents. Later, it was frequently used for similarity search and document duplicate detection especially in web pages
lP
650
[93, 96]. Apart from documents, Min-hash is successfully used in different areas such as-comparing and calculating the distance between genome and meta genomes [101], in domain of image clustering to cluster near duplicates [102],
655
urn a
in clustering of graphs of large data bases like social network [103], in network security Min-hash based sequence classification models are used to detect malwares [104]. In Software Defined Networks (SDN), it is used to build a malicious code classification system [105]. It has also been used in hybridization with HLL to improve the performance of HLL. [78]. Min-hash along with LSH is used in many applications domains to reduce high dimension similarity search to low 660
dimension similarity search [106].
Jo
5.2. Locality-Sensitive Hashing (LSH) The basic principle used in LSH is projection of higher dimensional data in
low dimensions subspace, using the fact that points close in many dimensions remain close in two dimensions too.
665
Let xi ∈
Journal Pre-proof
be a family of hash functions, mapping
pro of
For any two points xi , xj ∈
670
then P1 and P2 are the probabilities that xi and xj will reside in same bucket. The family of H is called locality sensitive or (d1 , d2 , P1 , P2 )− sensitive [107].
Definition 8: A family H of hash functions is said to be (d1 , d2 , P1 , P2 )-sensitive [95] if:
• D||xi , xj || ≤ d1 then P rH [h(xi ) = h(xj )] ≥ P1
re-
675
• D||xi , xj || ≥ d2 then P rH [h(xi ) = h(xj )] ≤ P2
for all cases d1 < d2 and all queries satisfy P1 > P2 , here D||xi , xj || denotes the
lP
distance between two points. If xi and xj are close in
Jo
urn a
680
Figure 8: Probability v/s distance measure in locality sensitive hashing [93]
LSH is used to solve (R,c)-Nearest Neighbour (NN) problem. (R,c)-NN prob-
lem is decision version of c-approximate NN problem. For (R,c)-NN problem in Locality Hash function r1 = R, r2 = cR, where c > 0. 47
re-
pro of
Journal Pre-proof
Figure 9: Locality-Sensitive Hashing framework
Definition 9: (Distance Measures [108])
lP
685
dimensions. Distance measure function on
690
urn a
• D|x, y| = 0 if and only if (x = y)
• D|x, y| = D|y, x|, i.e., distance is symmetric. • D|x, y| ≤ D|x, z| + D|z, y|, triangle inequality or length of shortest path rule.
Number of variants of LSH have been proposed depending upon the universe on which original data is mapped, i.e., on the basis of distance coefficients which satisfies Definition 9 because every distance measure may not have a
Jo
695
corresponding LSH family. Depending on the random function chosen and its locality sensitive properties, LSH is divided into various categories which are discussed in Appendix 2. The key idea of the LSH approximate nearest neighbor (NN) algorithm is to
700
construct a set of hash functions such that the probability of nearby points being 48
Journal Pre-proof
close after transformation with the hash function is larger than the probability
pro of
of two distant points being close after the same transformation. The range space of the function is discretized into buckets and we say that there is a ‘collision’ when two points end up in the same bucket. 705
LSH works by using a carefully selected hash function that causes objects or documents which are similar to have a high probability of colliding in a hash bucket. LSH consists of three phases: pre processing where data is mapped using different distance measures, hash generation where the hash tables are constructed, and similarity search, where the hash tables are used to identify similar items. Entire data is placed in n buckets such that similar items are
re-
710
placed in same bucket. The detailed operations of LSH are illustrated in Fig. 9. 5.2.1. Applications:
Some of the applications which require identification of similar items in-
715
lP
clude similarity in ranking of a product by two users in recommender systems, finding near duplicates corresponding to a particular query document in web documents, identifying similar type of truncations in databases, etc. In recent years, LSH has been used for many applications which require fast computa-
urn a
tional process [109, 110]; in pattern matching which include video identification [111]. Latest area of LSH applications use modified hashing techniques for faster 720
computational process. In mobile services, LSH is used for detecting clones in Android applications [112]. Bertine et al. [113] have used LSH for assembling large genomes with single-molecule sequence in bio-informatics. Naderi et al. have proposed the usage of Locality Sensitive Hashing (LSH) for Malware Signature Generation. It clusters various malicious programs to reduce the number of signatures significantly [114]. With the tremendous increase in sharing of on-
Jo
725
line video data, LSH has also been applied for deduplication of videos by Li et al. [115].
49
Journal Pre-proof
6. Discussion
730
pro of
In today’s world, data is originating from heterogeneous sources and current real world databases are severely susceptible to inconsistent, incomplete and noisy data [93]. In order to support data applications in different domains, data processing must be efficient and automated as much as possible. With an exponential increase in data, extraction of useful information from massive data, particularly for analytics is a daunting task [1]. Some of the applications 735
which need special attention include heavy hitters in data streams, frequency query for all items in the set, estimate the cardinality of massive dataset, find
re-
similar items in huge pool of items, membership query, etc. The main challenge is to store massive data in memory and then index all items for future reference. While dealing with Big data, especially when incoming data is continuous, al740
gorithm need to answer the query in one pass only.
lP
This paper discusses various application areas where probabilistic data structures help in reducing the space and time complexity to a great extent, especially for massive data sets.
Various variants of PDS are discussed and explained. Because of simplicity in design and adaptive nature, BF has been successfully used in a large number
urn a
745
of application domains. Variants discusses in Table I explain the modifications proposed in BF which make it a successful candidate for applications in different domains. It has been observed that recently the focus has shifted to BF which deals with streaming data, i.e., to those belonging to dynamic or ageing BF 750
category.
QF is an another cache friendly PDS used for membership query. QF has
Jo
major advantages over BF in terms of memory and computational time. Use of QF is beneficial when fast insertion and querying is required from the data stored in secondary memory. Further, merging two QFs without any change in
755
accuracy is an added advantage. Count-min sketch is motivated from counting BF concept to reduce error
in observations where number of BFs are used in parallel and minimum val-
50
Journal Pre-proof
ues from the counters of CMS are considered observed as final output. CMS
760
pro of
is most optimal option present in the counting algorithm group for problems like frequency query, heavy hitters, top-k query, etc. While working on skewed distributed data, CMS faces some problems like inefficiency in tracking heavy hitters, space wastage by not using all counters, etc. and such issues need a thorough consideration.
Hyperloglog and hyperloglog++ counters are used to determine cardinality 765
of a huge data set based on bit pattern observable principle and use of stochastic average and harmonic mean. Major advantage of HLL is that very small
re-
memory is used and error rate is significantly low and can be reduced further by using high bit hash functions.
Locality sensitive hashing helps to solve the approximate or exact Near 770
Neighbor Search in high dimensional spaces in sub-linear search time. Initially,
lP
all the data is mapped to low dimensional space and then hash based similarity measures are used to find closest cluster for queried item. There are few issues in LSH which need improvement in LSH. Preprocessing in LSH amplify the error rate leading to increase in the computational overhead many folds. Dy775
namic changes in the data sets are difficult to incorporate in LSH as it leads to
urn a
computational overhead of redoing all preprocessing work. Since hashing used in LSH is independent of the nature of data hashing bias may be observed in some cases.
Table 5 summaries the important features of all the PDS covered in this paper.
Jo
780
51
False Negatives
Element Count
Similarity search
Cardinality Estimate
Time Complexity
Space Complexity
Computational Cost
8.
9.
10.
11.
12.
13.
Retrieval of original data set
5.
7.
Merging
4.
False Positives
Querying
3.
6.
Deletion
52
re-
LOW
LOW
LOW
M EDIU M
M EDIU M
M EDIU M
LOW
M EDIU M
× LOW
?
×
?
×
×
× √
× √
?
× √
?
×
√
√
√
√
√
Count Min Sketch
lP √
√
Quotient Filter
×
?
× √
?
? √
√
Bloom Filter
Table 5: Comparative analysis of all PDS studied
√
× √
M EDIU M
LOW
M ED
× √
×
×
HIGH
M EDIU M
HIGH
×
× √
pro of
×
×
×
×
×
× √
√
Locality Sensitive Hashing
×
√
Hyper Log Log
? indicates that some variants of PDS may support the particular feature
```
2.
```
Hashing
Parameters↓
1.
S.No.
PDS →
urn a
``` ```
Jo
Journal Pre-proof
Journal Pre-proof
7. Conclusion and Future scope
pro of
This paper provides a comprehensive view of various prevalent PDS which can be used for storage, retrieval and mining of massive data sets. The data structures discussed in the paper can be used to store bulk data in minimum 785
space, find cardinality of data sets, identify similar data sets in Big data and find the frequency of the elements in massive data. All the PDS are supported with their mathematical proofs i.e. mathematical analysis of BF and QF is provided in section 2.1,2.2 respectively, analysis of CMS is provided by section 3.1, section 4.1 and Appendix 1 explains about HLL, and Appendix 2 discusses the details of LSH. Application areas have been discussed at the end of every PDS in the
re-
790
entire manuscript, indicating the domains where they have been successfully implemented. It has been experimentally proved that complexity of PDS is far better than the deterministic ones for various operations such as insert, delete,
795
lP
traversal, search along with other statistical queries. Recent developments in PDS and the aptness of the Smart City Realm for IoT and Big Data Applications have opened a plethora of research opportunities for the industry and academia. In the present era of IoT where sensors, social media, etc. are sending petabytes
urn a
of data per minute, the major challenges are do provide generic platforms for in stream data analytics especially when the volume, variety and velocity of 800
data is not known apriori. Although PDS has been proposed in literature for massive data handling in IoT and Smart Cities but till there are lot of domains which need to be catered which include smart health, smart vehicular network management which includes smart parking, smart environment management etc. Named Data Network (NDN) is another domain where PDS can be used in Forward Interest Table (FIB) Lookup to enhance routing scheme and Pending
Jo
805
Interest Table (PIT) for duplicate detection. PDS can be used in Bioinformatics since storing and pattern matching of k-mers and DNA sequences can be done efficiently through LSH in combination with Bloom Filter. Few researchers are focusing on efficient utilization of PDS in crypto currency and privacy preserving
810
especially in location aware applications and this needs further exploration. The
53
Journal Pre-proof
Table A.6: PDS Implementation Resources PDS
URL
1.
BF
https://github.com/jaybaird/python-bloomfilter
Description
2.
BF
https://github.com/seomoz/pyreBloom
3.
QF
https://github.com/vedantk/quotient-filter
4.
QF
https://github.com/bucaojit/ QuotientFilter
Quotient filter implementation in Java
5.
CMS
https://github.com/rafacarrascosa/ countminsketch
CountMinSketch is a minimalistic Count-min Sketch
6.
HLL
https://github.com/prasanthj/hyperloglog
7.
LSH
https://github.com/ekzhu/datasketch
8.
LSH
https://github.com/simonemainardi/LSHash
9.
LSH
https://github.com/go2starr/lshhdc
pro of
S.No.
pybloom includes Scalable Bloom Filter’s implementation
pyreBloom provides Redis backed Bloom Filter using GETBIT and SETBIT
Quotient filter in-memory implementation written in C
in pure Python
API support for specifying hashcode directly to compute cardinality estimate
Datasketch gives you probabilistic data structures that can process vary large amount of data LSHash is a fast Python implementation of locality
re-
sensitive hashing with persistence support
LSHHDC: Locality-Sensitive Hashing based High Dimensional Clustering
lP
variants of the PDS along with the applications can serve as a initial benchmark for readers who want to pursue their research in this area. Considering the exponential increase in data and the application domains it can be concluded that PDS can be used for large scale applications in various engineering fields.
APPENDIX
urn a
815
Appendix A. Implementation Resources In table A.6, some useful resources are described related to each PDS we have discussed so far.
Appendix B. (Hyper Log Log) Theorem 1: Let the algorithm HYPERLOGLOG be applied to an ideal multi-
Jo 820
set of (unknown) cardinality n, using m ≥ 3 registers, and let E be the resulting cardinality estimate. Proof: Here is the intuition underlying the algorithm. Let n be the unknown n cardinality of M. Each substream will comprise approximately ( m ) elements.
54
Journal Pre-proof
825
n Then, its Max-parameter should be close to log2 ( m ). The harmonic mean of n m.
An ideal multiset
pro of
the quantities 2M ax is then likely to be of the order of
of cardinality n is a sequence obtained by arbitrary replications and permutations applied to n uniform identically distributed random variables over the real interval [0:1]. 830
Note that the number of distinct elements of such an ideal multiset equals n with probability 1. Henceforth let Eˆn and Vˆn be the expectation and variance operators under this model.
re-
• The estimate E is asymptotically almost unbiased in the sense that 1 ˆ En (E)n→∞ = 1 + δ1 (n) + O(1) n where |δ1 (n)| < 5 × 10−5 as soon as m ≥ 16
835
q
Vˆn (E)
βm = √ + δ2 (n) + O(1) m where |δ2 (n)| < 5 × 10−4
n→∞
urn a
1 n
1 n
(B.2) (B.3)
q Vˆn (E), where n → ∞
lP
• The standard error defined as
(B.1)
as soon as m ≥ 16
(B.4) (B.5) (B.6)
the constants βm being bounded, with β16 = 1.106, β32 = 1.070, β64 = 1.054,
terms the typical error to be observed (in a mean quadratic sense). The func-
Jo
840
β128 = 1.046, and p β∞ = (3log2) − 1 = 1.03896. The standard error measures in relative
tions δ1 (n); δ2 (n) represent oscillating functions of a tiny amplitude, which
are computable, and whose effect could in theory be at least partly compensated—they can anyhow be safely neglected for all practical purposes. From Theorem 1 main conclusions to the effect that the relative accuracy of
845
hyperloglog is numerically close to
β∞ m .
55
The algorithm needs to maintain a
Journal Pre-proof
collection of registers, each of which is at most log2 log2 (N ) + O(1) bits, when
pro of
cardinalities ≤ N need to be estimated. As a consequence, using m = 2048,
hashing on 32 bits, cardinalities till values over N = 109 can be estimated with a typical accuracy of 2% using 1.5kB of storage.
850
Appendix C. (LSH Families)
Appendix C.1. LSH with Hamming distance
LSH on binary string was proposed by Indyk and Motwani [116], where a data set
re-
p, q ∈
lP
represented as binary string or U nary(p) function is used for replacing each coordinate of pi with equivalent binary string of γ bits. Hash functions for hamming space are constructed by selecting k bits randomly from the binary string b(.) of an element. ` hash functions are calculated,
urn a
given by:
∀`i=1 Hi ← Randomk (b(.))
860
Each hash function returns k random bits from the binary string b of an element of γ bits. For two items (p and q)∈
Jo
P r[hi (p) = hi (q)] = γ − ||p − q||h1 = (1 −
865
P r[hi (p) 6= hi (q)] = γ − ||p − q||h2 = (1 −
||p−q||h1 γ ||p−q||h2 γ
) )
This problem can be converted into (r1 , r2 , P1 , P2 ) problem. LSH with hamming ||p−q||h1 ||p−q||h2 distance is ||p − q||h1 , ||p − q||h2 , (1 − ), (1 − ) − sensitive. To γ γ
find (R,c)-nearest neighbor to the query element z, all hash functions are com-
puted for z, i.e.,
` i=1 hi (z)
←`i=1 Hi (z). Instead of comparing each binary string 56
Journal Pre-proof
870
with other, only hash values are compared and element having most similar
pro of
hash values are considered as nearest neighbor. Appendix C.2. LSH with Jacard Similarity
LSH is jaccard similarity [94] is computed with the help of Min-hash. The Signature Matrix(SM) act as a input. If m1 and m2 are the Min-hash based 875
distance between two rows of signature matrix then probability that both rows will appear in same band after applying banding technique will be (1 − m1 ) and (1 − m2 ) corresponding. After banding techniques similar items are grouped in same buckets with very high probability. According to definition of LSH family,
880
re-
this is a (m1 , m2 , (1 − m1 ), (1 − m2 ))− sensitive LSH family. Appendix C.2.1. LSH with Euclidean Distance(Edx,y )
LSH on points distributed in d-dimensional
lP
el. [117]. Let ai ∈
Edai ,aj
v u d uX a xal i − xl j =t l=1
urn a
For applying LSH on given points in ls space, first hash function (hi ) is generated by a random line (Li ) in a given plane. Li is divided into buckets of equal size (s) and orthogonal projection from the given points to the line is drawn. If points are close enough they lie in same bucket on the line Li . Let 885
ai , aj ∈
s 2
then there exists a probability that two points are in the
Jo
same bucket, i.e., P1 =
1 2
and if Edai ,aj ≥ 2s then probability that both points
resides in same bucket is dependent on angle between them, if θ lies between
890
60 < θ < 90, probability P2 = 31 . Hence, the family of points in d-dimensional space
orthogonal projection on a random line with intervals of size s is a ( 2s , 2s, 12 , 13 )sensitive family of hash functions. 57
Journal Pre-proof
Appendix C.3. LSH with Cosine Distance(θ)
pro of
LSH on vectors in d-dimensional
as a distance measure [118]. Let vi ∈
vi .vj |vi ||vj |
Let vi , vj ∈
re-
a hyperplane which is selected randomly so that angle between rvg and vi , vj varies, so dot product DPi = rvg .vi and DPj rvg .vj varies. If hyperplane is ran900
dom such that rvg lies between vi and vj then DPi and DPj have same signs
lP
other wise they have different signs.
P r(Sign(rvg .vi ) = Sign(rvg .vj )) =
θ 180
With cosine distance θ1 and θ2 , the pair of vectors in d-dimensional space vi , vj ∈
θ1 180 ), (1
−
θ2 180 )-
905
urn a
sensitive family of hash functions.
Here only four LSH families, which are normally found in the literature, have been discussed but they can always be further extended.
References
[1] S. Garc´ıa, S. Ram´ırez-Gallego, J. Luengo, J. M. Ben´ıtez, and F. Herrera, “Big data preprocessing: methods and prospects,” Big Data Analytics, vol. 1, no. 1, p. 9, 2016.
Jo 910
[2] L. Rutkowski, M. Jaworski, and P. Duda, “Basic concepts of data stream mining,” in Stream Data Mining: Algorithms and Their Probabilistic Properties.
Springer, 2020, pp. 13–33.
58
Journal Pre-proof
[3] C. Srinivasan, B. Rajesh, P. Saikalyan, K. Premsagar, and E. S. Yadav, “A review on the different types of internet of things (iot),” Journal of
pro of
915
Advanced Research in Dynamical and Control Systems, vol. 11, no. 1, pp. 154–158, 2019.
[4] M. P. Singh, M. A. Hoque, and S. Tarkoma, “Analysis of systems to process massive data stream,” CoRR, 2016. 920
[5] J. Bi and C. Zhang, “An empirical comparison on state-of-the-art multiclass imbalance learning algorithms and a new diversified ensemble learn-
re-
ing scheme,” Knowledge-Based Systems, vol. 158, pp. 81–93, 2018.
[6] W. Gan, J. C.-W. Lin, H.-C. Chao, H. Fujita, and P. S. Yu, “Correlated utility-based pattern mining,” Information Sciences, vol. 504, pp. 470 925
– 486, 2019. [Online]. Available: http://www.sciencedirect.com/science/
lP
article/pii/S0020025519306139
[7] A. Gakhov, Probabilistic Data Structures and Algorithms for Big Data Applications. [8] I. Katsov,
“Probabilistic data structures for web analytics and
data
mining,”
line].
Available:
2012,
urn a
930
BoD–Books on Demand, 2019.
[Accessed
Online:
May
2016].
[On-
https://highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-data-mining/
[9] B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Commun. ACM, vol. 13, no. 7, pp. 422–426, 1970.
935
[10] S. Tarkoma, C. E. Rothenberg, and E. Lagerspetz, “Theory and practice
Jo
of bloom filters for distributed systems,” IEEE Communications Surveys Tutorials, vol. 14, no. 1, pp. 131–155, 2012.
[11] A. Kirsch and M. Mitzenmacher, “Distance-sensitive bloom filters.” in Proceedings of the Meeting on Algorithm Engineering & Expermiments,
940
vol. 6.
Philadelphia, PA, USA: SIAM, 2006, pp. 41–50.
59
Journal Pre-proof
[12] J. Bruck, J. Gao, and A. Jiang, “Weighted bloom filter,” in IEEE InterIEEE, 2006.
pro of
national Symposium on Information Theory.
[13] L. Fan, P. Cao, J. Almeida, and A. Z. Broder, “Summary cache: A scalable wide-area web cache sharing protocol,” IEEE/ACM Trans. Netw., vol. 8, 945
no. 3, pp. 281–293, Jun. 2000.
[14] F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese, “An improved construction for counting bloom filters,” in Proceedings of the 14th Conference on Annual European Symposium, ser. ESA’06, vol. 14.
950
re-
London, UK: Springer-Verlag, 2006, pp. 684–695.
[15] D. Guo, J. Wu, H. Chen, Y. Yuan, and X. Luo, “The dynamic bloom filters,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 1, pp. 120–133, 2010.
lP
[16] P. S. Almeida, C. Baquero, N. Pregui¸ca, and D. Hutchison, “Scalable bloom filters,” Inf. Process. Lett., vol. 101, no. 6, pp. 255–261, Mar. 2007. 955
[17] F. Deng and D. Rafiei, “Approximately detecting duplicates for streaming data using stable bloom filters,” in Proceedings of the ACM SIGMOD
urn a
International Conference on Management of Data, ser. SIGMOD’06. New York, USA: ACM, 2006, pp. 25–36.
[18] A. Kirsch and M. Mitzenmacher, “Less hashing, same performance: Build960
ing a better bloom filter,” Random Struct. Algorithms, vol. 33, no. 2, pp.
187–218, Sep. 2008.
[19] S. Geravand and M. Ahmadi, “Bloom filter applications in network secu-
Jo
rity: A state-of-the-art survey,” Computer Networks, vol. 57, no. 18, pp. 4047–4064, 2013.
965
[20] K. W. Choi, D. T. Wiriaatmadja, and E. Hossain, “Discovering mobile applications in cellular device-to-device communications: Hash function and bloom filter-based approach,” IEEE Transactions on Mobile Computing, vol. 15, no. 2, pp. 336–349, 2016. 60
Journal Pre-proof
[21] K. Verma and H. Hasbullah, “Bloom-filter based ip-chock detection scheme for denial of service attacks in vanet,” Security and Communi-
pro of
970
cation Networks, vol. 8, no. 5, pp. 864–878, 2015.
[22] W. Song, B. Wang, Q. Wang, Z. Peng, W. Lou, and Y. Cui, “A privacypreserved full-text retrieval algorithm over encrypted data for cloud storage applications,” Journal of Parallel and Distributed Computing, vol. 99, 975
pp. 14 – 27, 2017.
[23] B. Groza and P.-S. Murvay, “Efficient intrusion detection with bloom fil-
re-
tering in controller area networks,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 4, pp. 1037–1051, 2019. [24] K. Cheng, “Hot spot tracking by time-decaying bloom filters and reser980
voir sampling,” in International Conference on Advanced Information NetSpringer, 2019, pp. 1147–1156.
lP
working and Applications.
[25] M. Najam, R. U. Rasool, H. F. Ahmad, U. Ashraf, and A. W. Malik, “Pattern matching for dna sequencing data using multiple bloom filters,” BioMed Research International, vol. 2019, 2019. [26] Quora, “What are the best applications of bloom filters?”
urn a
985
2014,
[Accessed Online: Feb 2017]. [Online]. Available: https://www.quora. com/What-are-the-best-applications-of-Bloom-filters
[27] A. Singh, S. Garg, K. Kaur, S. Batra, N. Kumar, and K.-K. R. Choo, “Fuzzy-folded bloom filter-as-a-service for big data storage on cloud,” IEEE Transactions on Industrial Informatics, 2018.
[28] P. Liu, H. Wang, S. Gao, T. Yang, L. Zou, L. Uden, and X. Li, “Id
Jo
990
bloom filter: Achieving faster multi-set membership query in network applications,” in 2018 IEEE International Conference on Communications (ICC).
IEEE, 2018, pp. 1–6.
61
Journal Pre-proof
995
[29] J. Lu, Y. Wan, Y. Li, C. Zhang, H. Dai, Y. Wang, G. Zhang, and B. Liu,
pro of
“Ultra-fast bloom filters using simd techniques,” IEEE Transactions on Parallel and Distributed Systems, vol. 30, no. 4, pp. 953–964, 2019.
[30] R. Patgiri, S. Nayak, and S. K. Borgohain, “rdbf: A r-dimensional bloom filter for massive scale membership query,” Journal of Network and Com1000
puter Applications, 2019.
[31] Z. Sun, S. Gao, B. Liu, Y. Wang, T. Yang, and B. Cui, “Magic cube bloom filter: Answering membership queries for multiple sets,” in 2019 IEEE
IEEE, 2019, pp. 1–8. 1005
re-
International Conference on Big Data and Smart Computing (BigComp).
[32] M. Mitzenmacher, “Compressed bloom filters,” IEEE/ACM Transactions
lP
on Networking, vol. 10, no. 5, pp. 604–612, 2002. [33] S. Cohen and Y. Matias, “Spectral bloom filters,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, ser. SIGMOD’03. 1010
New York, USA: ACM, 2003, pp. 241–252.
[34] A. Kumar, J. J. Xu, L. Li, and J. Wang, “Space-code bloom filter for
urn a
efficient traffic flow measurement,” in Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement, ser. IMC’03.
New York,
USA: ACM, 2003, pp. 167–172.
[35] E.-J. Goh, “Secure indexes.” IACR Cryptology ePrint Archive, vol. 2003, 1015
pp. 2–16, 2003.
[36] K. Shanmugasundaram, H. Br¨ onnimann, and N. Memon, “Payload attri-
Jo
bution via hierarchical bloom filters,” in Proceedings of the 11th ACM Conference on Computer and Communications Security, ser. CCS’04.
New York, USA: ACM, 2004, pp. 31–41.
1020
[37] B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal, “The bloomier filter: An efficient data structure for static support lookup tables,” in Proceedings
62
Journal Pre-proof
of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms,
pro of
ser. SODA’04. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2004, pp. 30–39. 1025
[38] M.-Z. Xiao, Y.-F. Dai, and X.-M. Li, “Split bloom filter,” Tien Tzu Hsueh Pao/Acta Electronica Sinica, vol. 32, pp. 241–245, 2004.
[39] F. Chang, W. chang Feng, and K. Li, “Approximate caches for packet classification,” in Twenty-third AnnualJoint Conference of the IEEE Computer and Communications Societies (INFOCOM’04), vol. 4, March 2004, pp. 2196–2207.
re-
1030
[40] Y. Lu, B. Prabhakar, and F. Bonomi, “Bloom filters: Design innovations and novel applications,” in In Proc. of the Forty-Third Annual Allerton
lP
Conference, 2005.
[41] B. Donnet, B. Baynat, and T. Friedman, “Retouched bloom filters: Al1035
lowing networked applications to trade off selected false positives against false negatives,” in Proceedings of the ACM CoNEXT Conference, ser. CoNEXT’06.
J. Gao,
and A. A. Jiang,
urn a
[42] J. Bruck,
New York, USA: ACM, 2006, pp. 13:1–13:12. “Adaptive bloom filter,”
California Institute of Technology, 2006. [Online]. Available:
1040
http:
//authors.library.caltech.edu/26103/1/etr072.pdf
[43] M. Zhong, P. Lu, K. Shen, and J. Seiferas, “Optimizing data popularity conscious bloom filters,” in Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing, ser. PODC’08. New York,
Jo
NY, USA: ACM, 2008, pp. 355–364.
1045
[44] M. Ahmadi and S. Wong, “A memory-optimized bloom filter using an additional hashing function,” in IEEE Global Telecommunications Con-
ference (GLOBECOM’08), Nov 2008, pp. 1–5.
63
Journal Pre-proof
[45] A. Goel and P. Gupta, “Small subset queries and bloom filters using
1050
pro of
ternary associative memories, with applications,” SIGMETRICS Perform. Eval. Rev., vol. 38, no. 1, pp. 143–154, Jun. 2010.
[46] C. E. Rothenberg, C. A. B. Macapuna, F. L. Verdi, and M. F. Magalhaes, “The deletable bloom filter: a new member of the bloom family,” IEEE Communications Letters, vol. 14, no. 6, pp. 557–559, June 2010.
[47] R. P. Laufer, P. B. Velloso, and O. C. M. B. Duarte, “A generalized bloom 1055
filter to secure distributed network applications,” Comput. Netw., vol. 55,
re-
no. 8, pp. 1804–1819, Jun. 2011.
[48] J. L. Dautrich, Jr. and C. V. Ravishankar, “Inferential time-decaying bloom filters,” in Proceedings of the 16th International Conference on Extending Database Technology, ser. EDBT’13. 2013, pp. 239–250.
lP
1060
New York, USA: ACM,
[49] F. Concas, P. Xu, M. A. Hoque, J. Lu, and S. Tarkoma, “Multiple set matching and pre-filtering with bloom multifilters.” [50] M. Mitzenmacher, “A model for learned bloom filters and related struc-
1065
urn a
tures,” arXiv preprint arXiv:1802.00884, 2018. [51] A. Singh and S. Batra, “Streamed data analysis using adaptable bloom filter,” Computing and Informatics, vol. 37, no. 3, pp. 693–716, 2018.
[52] Y. Hua, B. Xiao, B. Veeravalli, and D. Feng, “Locality-sensitive bloom filter for approximate membership query,” IEEE Transactions on Com-
puters, vol. 61, no. 6, pp. 817–830, 2012.
[53] S. Negi, A. Dubey, A. Bagchi, M. Yadav, N. Yadav, and J. Raj, “Dynamic
Jo
1070
partition bloom filters: A bounded false positive solution for dynamic set membership,” arXiv preprint arXiv:1901.06493, 2019.
[54] N. Mousavi and M. Tripunitara, “Constructing cascade bloom filters for efficient access enforcement,” Computers & Security, vol. 81, pp. 1–14,
1075
2019. 64
Journal Pre-proof
[55] M. A. Bender, M. Farach-Colton, R. Johnson, R. Kraner, B. C. Kuszmaul,
pro of
D. Medjedovic, P. Montes, P. Shetty, R. P. Spillane, and E. Zadok, “Don’t thrash: How to cache your hash on flash,” Proc. VLDB Endow., vol. 5, no. 11, pp. 1627–1637, Jul. 2012. 1080
[56] D. E. Knuth, The Art of Computer Programming: Sorting and Searching. Addison-Wesley, 1998.
[57] M. Al-hisnawi and M. Ahmadi, “Deep packet inspection using quotient filter,” IEEE Communications Letters, vol. 20, no. 11, pp. 2217–2220, Nov
1085
re-
2016.
[58] S. Dutta, A. Narang, and S. K. Bera, “Streaming quotient filter: A near optimal approximate duplicate detection approach for data streams,”
lP
Proc. VLDB Endow., vol. 6, no. 8, pp. 589–600, Jun. 2013. [59] P. Goudarzi, H. T. Malazi, and M. Ahmadi, “Khorramshahr: A scalable peer to peer architecture for port warehouse management system,” Jour1090
nal of Network and Computer Applications, vol. 76, pp. 49 – 59, 2016. [60] S. Garg, A. Singh, K. Kaur, G. S. Aujla, S. Batra, N. Kumar, and M. Obai-
urn a
dat, “Edge computing-based security framework for big data analytics in vanets,” IEEE Network, vol. 33, no. 2, pp. 72–81, 2019.
[61] S. Garg, A. Singh, K. Kaur, S. Batra, N. Kumar, and M. S. Obaidat, 1095
“Edge-based content delivery for providing qoe in wireless networks using quotient filter,” in 2018 IEEE International Conference on Communications (ICC).
IEEE, 2018, pp. 1–6.
Jo
[62] R. Shubbar and M. Ahmadi, “Efficient name matching based on a fast two-dimensional filter in named data networking,” International Journal
1100
of Parallel, Emergent and Distributed Systems, vol. 34, no. 2, pp. 203–221,
2019.
65
Journal Pre-proof
[63] R. S. Boyer and J. S. Moore, MJRTY—A Fast Majority Vote Algorithm.
pro of
Dordrecht: Springer Netherlands, 1991, pp. 105–117, doi:10.1007/978-94011-3488-0 5. 1105
[64] G. S. Manku and R. Motwani, “Approximate frequency counts over data streams,” in Proceedings of the 28th international conference on Very Large Data Bases.
VLDB Endowment, 2002, pp. 346–357.
[65] A. Metwally, D. Agrawal, and A. El Abbadi, “Efficient computation of frequent and top-k elements in data streams,” in International Conference on Database Theory, ser. ICDT’05. Berlin, Heidelberg: Springer-Verlag,
re-
1110
2005, pp. 398–412, doi:10.1007/978-3-540-30570-5 27. [66] G. Cormode and S. Muthukrishnan, “An improved data stream summary: the count-min sketch and its applications,” Journal of Algorithms, vol. 55,
1115
lP
no. 1, pp. 58–75, 2005.
[67] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent items in data streams,” Automata, languages and programming, pp. 784–784, 2002. Wu,
“Count
urn a
[68] S.
min
sketch
and
its
applications,”
http://grigory.us/files/cm-sketch.pdf, December 2014, [Accessed Online:
1120
April 2016].
[69] S.
Mewtoo.
(2010)
Count
https://sites.google.com/site/countminsketch/.
min [Accessed
sketch. on:
Dec
2016].
Jo
[70] P. P. Talukdar and W. W. Cohen, “Scaling graph-based semi supervised 1125
learning to large number of labels using count-min sketch.” in AISTATS, 2014, pp. 940–947, doi:https://arxiv.org/abs/1310.2959.
[71] X. D. Hoang and H. K. Pham, “A review on hot-ip finding methods and its application in early ddos target detection,” Future Internet, vol. 8, no. 4, p. 52, 2016. 66
Journal Pre-proof
1130
[72] G. Pitel, G. Fouquier, E. Marchand, and A. Mouhamadsultane, “Count-
pro of
min tree sketch: Approximate counting for nlp,” in 2nd International Symposium on Web Algorithms (ISWAG’2016), Deauville, France, vol. 1, 2016.
[73] N. Bonelli, C. Callegari, and G. Procissi, “A probabilistic counting frame1135
work for distributed measurements,” IEEE Access, vol. 7, pp. 22 644– 22 659, 2019.
[74] X. Zhu, G. Wu, H. Zhang, S. Wang, and B. Ma, “Dynamic count-min
re-
sketch for analytical queries over continuous data streams,” in 2018 IEEE 25th International Conference on High Performance Computing (HiPC). 1140
IEEE, 2018, pp. 225–234.
[75] J. S. Moore, “A fast majority vote algorithm,” Automated Reasoning:
lP
Essays in Honor of Woody Bledsoe,
1981. [Online]. Available:
ftp://www.cs.utexas.edu/pub/boyer/ics-reports/cmp32.pdf [76] S. Matusevych, A. J. Smola, and A. Ahmed, “Hokusai-sketching streams 1145
in real time,” in Proceedings of the Twenty-Eighth Conference on Uncer-
urn a
tainty in Artificial Intelligence, ser. UAI’12. Arlington, Virginia, United States: AUAI Press, 2012, pp. 594–603.
[77] M. Durand and P. Flajolet, “Loglog counting of large cardinalities,” in In ESA, 2003, pp. 605–617.
1150
[78] P. Flajolet, E. Fusy, and O. Gandouet, “Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm,” in Proceedings of The
Jo
International Conference On Analysis Of Algorithms (AOFA’07), 2007. [Online]. Available:
http://cscubs.cs.uni-bonn.de/2016/proceedings/
paper-03.pdf
1155
[79] E. Fusy and F. Giroire, “Estimating the number of active flows in a data stream over a sliding window,” in Proceedings of the Meeting on Analytic
67
Journal Pre-proof
Algorithmics and Combinatorics, ser. ANALCO’07.
Philadelphia, PA,
pro of
USA: Society for Industrial and Applied Mathematics, 2007, pp. 223–231. [80] P. Flajolet and G. N. Martin, “Probabilistic counting algorithms for data 1160
base applications,” Journal of Computer and System Sciences, vol. 31, no. 2, pp. 182 – 209, 1985. [81] T.
Karnezos,
“HLL
talk
at
https://research.neustar.biz/2014/09/23/hll-talk-at-sfpug/,
SFPUG,” Septem-
ber 2014, [Accessed Online: Jan 2017].
[82] W. Wu, J. F. Naughton, and H. Singh, “Sampling-based query re-
re-
1165
optimization,” in Proceedings of the International Conference on Management of Data, ser. SIGMOD’16.
lP
pp. 1721–1736.
New York, NY, USA: ACM, 2016,
[83] E. Georganas, A. Bulu¸c, J. Chapman, L. Oliker, D. Rokhsar, and 1170
K. Yelick, “Parallel de bruijn graph construction and traversal for de novo genome assembly,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press,
urn a
2014, pp. 437–448.
[84] G. Drakopoulos, S. Kontopoulos, and C. Makris, “Eventually consistent 1175
cardinality estimation with applications in biodata mining,” in Proceedings
of the 31st Annual ACM Symposium on Applied Computing. ACM, 2016, pp. 941–944.
[85] Y. Zhao, S. Guo, and Y. Yang, “Hermes: An optimization of hyperloglog
Jo
counting in real-time data processing,” in International Joint Conference
1180
on Neural Networks (IJCNN).
IEEE, 2016, pp. 1890–1895.
[86] S. Dietzel, A. Peter, and F. Kargl, “Secure cluster-based in-network information aggregation for vehicular networks,” in IEEE 81st Vehicular Technology Conference (VTC Spring).
68
IEEE, 2015, pp. 1–5.
Journal Pre-proof
[87] G. Cormode, Streaming Methods in Data Analysis. International Publishing, 2015, pp. 3–6.
pro of
1185
Cham: Springer
[88] Z. Zhou and B. Hajek, “Per-flow cardinality estimation based on virtual loglog sketching,” in 2019 53rd Annual Conference on Information Sciences and Systems (CISS).
IEEE, 2019, pp. 1–6.
[89] D. N. Baker and B. Langmead, “Dashing: Fast and accurate genomic 1190
distances with hyperloglog,” BioRxiv, p. 501726, 2018.
[90] R. Morris, “Counting large numbers of events in small registers,” Com-
re-
mun. ACM, vol. 21, no. 10, pp. 840–842, 1978.
[91] M. Wegman, “Sample counting,” Private Communication, 1984. [92] S. Heule, M. Nunkesser, and A. Hall, “Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm,” in
lP
1195
Proceedings of the 16th International Conference on Extending Database Technology, ser. EDBT ’13. New York, NY, USA: ACM, 2013, pp. 683– 692, doi:10.1145/2452376.2452456.
1200
urn a
[93] A. Rajaraman and J. D. Ullman, Mining of Massive Datasets.
New
York, NY, USA: Cambridge University Press, 2011. [Online]. Available: http://infolab.stanford.edu/∼ullman/mmds/book.pdf
[94] A.
Al-Fuqaha,
hashing
locality
“Similarity
sensitive
analysis
hashing,”
and
distance
https://cs.wmich.edu/
fuqaha/summer14/cs6530/lectures/SimilarityAnalysis.pdf,
al[Ac-
cessed Online: March 2017].
Jo
1205
2014,
min-
[95] G. Shakhnarovich, P. Indyk, and T. Darrell, “Locality sensitive hashing,”
https://en.wikipedia.org/wiki/Locality-sensitive-hashing/,
2007,
[Accessed Online: Dec 2016].
[96] A. Broder, “On the resemblance and containment of documents,” in
1210
Proceedings of the Compression and Complexity of Sequences, ser. SE69
Journal Pre-proof
QUENCES ’97.
Washington, DC, USA: IEEE Computer Society, 1997,
pro of
pp. 21–29.
[97] M. Datar and S. Muthukrishnan, Estimating Rarity and Similarity over Data Stream Windows. 1215
Berlin, Heidelberg: Springer Berlin Heidelberg,
2002, pp. 323–335, doi:10.1007/3-540-45749-6 31.
[98] O. Chum, M. Perd’och, and J. Matas, “Geometric min-hashing: Finding a thick needle in a haystack,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09).
IEEE, 2009, pp. 17–24.
1220
re-
[99] S. Ioffe, “Improved consistent sampling, weighted minhash and l1 sketching,” in 10th International Conference on Data Mining (ICDM). IEEE, 2010, pp. 246–255, doi:10.1109/ICDM.2010.80.
lP
[100] A. Z. Broder and C. G. Nelson, “Method for determining the resemining the resemblance of documents,” May 2001, uS Patent 6,230,155. [101] B. D. Ondov, T. J. Treangen, P. Melsted, A. B. Mallonee, N. H. Bergman, 1225
S. Koren, and A. M. Phillippy, “Mash: fast genome and metagenome distance estimation using minhash,” Genome Biology, vol. 17, no. 1, p.
urn a
132, 2016.
[102] S. Thaiyalnayaki and J. Sasikala, “Indexing near-duplicate images in web search using minhash algorithm,” in International Conference on Process-
1230
ing of Materials, Minerals and Energy (PMME). Elsevier, 2016, pp. 1–7.
[103] S.-J. Lee and J.-K. Min, “An efficient large graph clustering technique
Jo
based on min-hash,” Journal of KIISE, vol. 43, no. 3, pp. 380–388, 2016.
[104] J. Drew, T. Moore, and M. Hahsler, “Polymorphic malware detection using sequence classification methods,” in Security and Privacy Workshops
1235
(SPW).
IEEE, 2016, pp. 81–87.
[105] S.-H. Lee, M.-U. Song, J.-K. Jung, and T.-M. Chung, “A study of malicious code classification system using minhash in network quarantine 70
Journal Pre-proof
using sdn,” in International Conference on Computer Science and its Ap-
1240
Springer, 2016, pp. 594–599.
pro of
plications.
[106] B. Rao and E. Zhu, “Searching web data using minhash lsh,” in Proceedings of the International Conference on Management of Data, ser. SIGMOD’16.
New York, NY, USA: ACM, 2016, pp. 2257–2258.
[107] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in Proceedings of the 25th International Conference on 1245
Very Large Data Bases, ser. VLDB’99. San Francisco, CA, USA: Morgan
re-
Kaufmann Publishers Inc., 1999, pp. 518–529.
[108] J. Wang, H. T. Shen, J. Song, and J. Ji, “Hashing for similarity search: A survey,” CoRR, vol. abs/1408.2927, 2014.
1250
lP
[109] F. Chierichetti and R. Kumar, “Lsh-preserving functions and their applications,” Journal of the ACM (JACM), vol. 62, no. 5, p. 33, 2015. [110] A. Becker, L. Ducas, N. Gama, and T. Laarhoven, “New directions in nearest neighbor searching with applications to lattice sieving,” in Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algo-
1255
SIAM, 2016, pp. 10–24.
urn a
rithms.
[111] Z. Kang, W. T. Ooi, and Q. Sun, “Hierarchical, non-uniform locality sensitive hashing and its application to video identification,” in IEEE International Conference on Multimedia and Expo (ICME’04), vol. 1. IEEE, 2004, pp. 743–746.
[112] C. Soh, H. B. K. Tan, Y. L. Arnatovich, and L. Wang, “Detecting clones in android applications through analyzing user interfaces,” in 2015 IEEE
Jo
1260
23rd International Conference on Program Comprehension, May 2015, pp. 163–173.
[113] K. Berlin, S. Koren, C.-S. Chin, J. P. Drake, J. M. Landolin, and A. M. Phillippy, “Assembling large genomes with single-molecule sequencing and
71
Journal Pre-proof
1265
locality-sensitive hashing,” Nature biotechnology, vol. 33, no. 6, pp. 623–
pro of
630, 2015.
[114] H. Naderi, P. Vinod, M. Conti, S. Parsa, and M. H. Alaeiyan, “Malware signature generation using locality sensitive hashing,” in International Conference on Security & Privacy. 1270
Springer, 2019, pp. 115–124.
[115] Y. Li, L. Hu, K. Xia, and J. Luo, “Fast distributed video deduplication via locality-sensitive hashing with similarity ranking,” EURASIP Journal on Image and Video Processing, vol. 2019, no. 1, p. 51, 2019.
re-
[116] P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” in Proceedings of the Thirtieth An1275
nual ACM Symposium on Theory of Computing, ser. STOC’98.
New
lP
York, USA: ACM, 1998, pp. 604–613.
[117] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in Proceedings of the Twentieth Annual Symposium on Computational Geometry, ser. SCG’4. 1280
ACM, 2004, pp. 253–262.
urn a
[118] M. S. Charikar, “Similarity estimation techniques from rounding algorithms,” in Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, ser. STOC’02.
Jo
380–388.
72
New York, USA: ACM, 2002, pp.