Probabilistic data structures for big data analytics: A comprehensive review

Probabilistic data structures for big data analytics: A comprehensive review

Journal Pre-proof Probabilistic data structures for big data analytics: A comprehensive review Amritpal Singh, Sahil Garg, Ravneet Kaur, Shalini Batra...

3MB Sizes 1 Downloads 184 Views

Journal Pre-proof Probabilistic data structures for big data analytics: A comprehensive review Amritpal Singh, Sahil Garg, Ravneet Kaur, Shalini Batra, Neeraj Kumar, Albert Y. Zomaya

PII: DOI: Reference:

S0950-7051(19)30407-1 https://doi.org/10.1016/j.knosys.2019.104987 KNOSYS 104987

To appear in:

Knowledge-Based Systems

Received date : 12 April 2019 Revised date : 21 August 2019 Accepted date : 21 August 2019 Please cite this article as: A. Singh, S. Garg, R. Kaur et al., Probabilistic data structures for big data analytics: A comprehensive review, Knowledge-Based Systems (2019), doi: https://doi.org/10.1016/j.knosys.2019.104987. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier B.V.

Journal Pre-proof *Conflict of Interest Form

AUTHOR DECLARATION

 



lP



pro of



We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property. We further confirm that any aspect of the work covered in this manuscript that has involved either experimental animals or human patients has been conducted with the ethical approval of all relevant bodies and that such approvals are acknowledged within the manuscript. We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He/she is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs.

re-



urn a

We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author and which has been configured to accept email from ([email protected] (Sahil Garg), [email protected] (Ravneet Kaur),[email protected] (Shalini Batra), [email protected] (Neeraj Kumar), [email protected])

Jo

Approved By all Authors.

(Authors belongs to different locations so signing a document is not possible at this stage)

Journal Pre-proof *Revised Manuscript (Clean Version) Click here to view linked References

pro of

Probabilistic Data Structures for Big Data Analytics: A Comprehensive Review Amritpal Singha , Sahil Garga , Ravneet Kaura , Shalini Batraa , Neeraj Kumara,∗, Albert Y Zomayab a Computer b Centre

Science & Engineering Department, Thapar University, Patiala (Punjab), India. for Distributed and High Performance Computing, The University of Sydney, Sydney, Australia.

re-

Abstract

An exponential increase in the data generation resources is widely observed in last decade, because of evolution in technologies such as-cloud computing, IoT, social networking, etc. This enormous and unlimited growth of data has led to a

lP

paradigm shift in storage and retrieval patterns from traditional data structures to Probabilistic Data Structures (PDS). PDS are a group of data structures that are extremely useful for Big data and streaming applications in order to avoid high-latency analytical processes. These data structures use hash functions to compactly represent a set of items in stream-based computing while providing

urn a

approximations with error bounds so that well-formed approximations get built into data collections directly. Compared to traditional data structures, PDS use much less memory and constant time in processing complex queries. This paper provides a detailed discussion of various issues which are normally encountered in massive data sets such as-storage, retrieval, query, etc. Further, role of PDS in solving these issues is also discussed where these data structures are used as temporary accumulators in query processing. Several variants of existing PDS

Jo

along with their application areas have also been explored which give a holistic view of domains where these data structures can be applied for efficient storage ∗ Corresponding

author. Email addresses: [email protected] (Amritpal Singh), [email protected] (Sahil Garg), [email protected] (Ravneet Kaur), [email protected] (Shalini Batra), [email protected] (Neeraj Kumar), [email protected] (Albert Y Zomaya)

Preprint submitted to Journal of Network and Computer Applications

August 21, 2019

Journal Pre-proof

and retrieval of massive data sets. Mathematical proofs of various parameters

pro of

considered in the PDS have also been discussed in the paper. Moreover, the relative comparison of various PDS with respect to various parameters is also explored.

Keywords: Big data, Internet of Things (IoT), Probabilistic Data Structures, Bloom filter, Quotient Filter, Count Min Sketch, HyperLogLog Counter, Min-Hash, Locality sensitive hashing

re-

1. Introduction

From the last few years, there is an exponential increase in the data. The amount of data being produced everyday from different sources such as-IoT sensors, social networks like Twitter, Instagram, WhatsApp, etc. has increased 5

from terabytes to petabytes. This voluminous data growth abetted with efficient

lP

storage and retrieval poses a big challenge for industry as well as academia [1]. To handle this large volume of data, traditional algorithms cannot go beyond linear processing. Moreover, traditional approaches demand that entire data should be stored in a formatted manner. These massive datasets require architectures and tools for data storage, processing, mining, handling and leveraging

urn a

10

of the information to offer better services. In the age of in-stream data [2] and Internet of things (IoT) [3], there is no limit on the amount of data coming from varied sources. Moreover, the complexity of data and the amount of noise associated with the data is not 15

predefined. Since the size of data is unknown, one cannot determine how much memory is required for storing the data. Moreover, the amount of data to be

Jo

analysed is in exabytes, which is too large to fit in the memory space provided with linear processing and actual storage of data is challenging. Thus, it is difficult to capture, store and process the incoming data within the stipulated

20

time [4]. Data sets with such characteristics are typically referred to as Big data. Various definitions have been used to define Big data from different perspectives. Machine learning is used in number of applications for optimization [5]. Further,

2

Journal Pre-proof

the trend of traditional data mining is shifting towards more complex task i.e.

25

pro of

correlated utility-based pattern mining [6]. In this paper we try to define Big data’s most relevant characteristics from data analytics view, referred as 9 V’s

Jo

urn a

lP

re-

model. The illustrative description about these V’s is depicted using Fig. 1.

Figure 1: Overview of Big data

Big data technologies are important in providing accurate analysis, leading to

more concrete decision-making; resulting in greater operational efficiencies, cost

3

Journal Pre-proof

reductions, and reduced risks for the business. To cope with Big data efficiently, new technologies appeared that enabled distributed data storage and parallel

pro of

30

data processing. Various technologies from different vendors include MapReduce by Google which provides a new method of analyzing data that can be scaled up from single servers to thousands of high and low end machines; NoSQL Big data systems which are designed to take advantage of new cloud computing 35

architectures to allow massive computations to be run inexpensively and efficiently; Amazon Azure, etc., which provide various tools to handle Big data. Along with the above mentioned technologies, Apache Hadoop (with its HDFS

re-

and MapReduce components) was a pioneering technology. Hadoop developed by Apache is an open source tool and most commonly used “Hadoop MapRe40

duce” is based on the Google’s MapReduce combined with Hadoop. Hadoop is a package of many components, which come in various formats which include

lP

Apache hive: infrastructure for data warehousing, Apache oozie: for scheduling Hadoop job, Apache Pig: a data flow platform responsible for the execution of the MapReduce jobs, Apache Spark: 45

an open source framework used for

cluster computing, etc. Although Hadoop provides an overall package to the Big data analytics, with less technical background to operate, still there are

urn a

some issues which need optimized solutions in Hadoop. In Hadoop with a parallel and distributed algorithm, the MapReduce process large data sets. Data is distributed and processed over the cluster in MapReduce leading to increase 50

in the processing time and decrease om processing speed. Further, Hadoop supports batch processing only, it does not process streamed data, and hence overall performance is slower (Apache Spark supports stream processing). One of the major issues in Hadoop is that its programming model is quite restrictive

Jo

which prevents modifications in inbuilt algorithms easily. The efficient analysis

55

of in-stream data often requires powerful tools such as Apache Spark, Google BigQuery, High-Performance Computing Cluster (HPCC), etc. However, these tools are not suitable for real-time use cases where fast response is required such as-processing data in specific application domain, implementation of interactive jobs and models, etc. Recent research directions in the area of Big data process4

Journal Pre-proof

60

ing, analysis and visualization clearly indicate the importance of Probabilistic

pro of

Data Structures (PDS).

The use of deterministic data structures to perform analysis of in-stream data often include plenty of computational, space and time complexity. Probabilistic alternatives (Probabilistic Data Structures (PDS)) to deterministic data 65

structures are better in terms of simplicity and constant factors involved in actual run-time. They are suitable for large data processing, approximate predications, fast retrieval and storing unstructured data, thus playing an important

70

re-

role in Big data processing.

PDS are, tautologically speaking, data structures having a probabilistic component [7]. These probabilistic components are used to reduce time or space trade offs. PDS cannot give a definite answer, instead they provide with a rea-

lP

sonable approximation of the answer and a way to approximate this estimation. They are useful for Big data and streaming applications because they can de75

crease the amount of memory needed (in comparison to data structures that give exact answers) [8]. Different variants of PDS are highlighted in Fig. 2. In majority of the cases, these data structures use hash functions to randomize the

urn a

items. Because they ignore collisions so they keep the size constant, but this is also a reason why they cannot give exact values. Moreover, PDS offer several 80

advantages which are as given below: • They use small amount of memory (one can control how much). • They are easily parallelizable (hashes are independent).

Jo

• They have constant query time. Major focus of this paper is on role of Probabilistic Data Structures (PDS)

85

in the following scenarios: • Approximate Membership Query: Store bulk data in small space and respond to user’s membership query efficiently in the given space S.

5

re-

pro of

Journal Pre-proof

Figure 2: Overview of PDS

• Frequency Count: Find cardinality, i.e., number of cardinal (basic) mem-

90

lP

bers in a set in the massive data set.

• Cardinality Estimate: Count the number of times a data item has arrived in the massive data sets.

• Similarity Search: Identify similar items, i.e., find the approximately near-

urn a

est neighbors (most similar) to the query in the available dataset. Organization of paper: Section II provides detailed discussion of approximate 95

membership query using the most frequently used PDS, Bloom Filter (BF) and its variant Quotient Filter (QF). Section III discusses how frequency count problem is solved efficiently by the PDS named Count Min Sketch (CMS). Section IV provides an insight on cardinality estimation by using Hyper Log Log (HLL) counter along with a relative comparison and reviews of various variants of HLL. Section V discusses the PDS used for similarity search of massive Big data and

Jo

100

provides a detailed discussion on Min-Hash and family of Locality Sensitive Hashing (LSH) (Various families of LSH, based on the distance matrix used, have been discussed in the Appendix 1). Section VI summarizes the role of all the above mentioned PDS with respect to various parameters. Finally, Section

105

VII concludes the paper. 6

Journal Pre-proof

2. Approximate Membership Query

pro of

Given millions or even billions of data elements, developing efficient solutions for storing, updating, and querying them becomes different important especially when data is queried in some real time application. Using traditional data base 110

approaches which include performing filtering and analysis after storing the data are not efficient for real time processing. Since the size of data is bulky, which require large data structures, retrieval cost of even a small query is very high. Above mentioned issues clearly indicate that efficient storage and searching techniques are required for processing Big data. In applications where efficiency is more important than accuracy, use of probabilistic approaches and approxima-

re-

115

tion algorithms can serve as a key ingredient in data processing. Thus, problem of membership query is converted to approximate membership query, where

120

lP

probabilistic results are acceptable.

Definition 1: (Membership Query) For a given set S = {x1 , x2 , ..., xN } with N items, membership query confirms the presence of queried element xq in the set using deterministic approaches. The result of membership query is in binary

urn a

form, i.e., 1 indicates xq ∈ S and 0 indicates that xq ∈ / S. Space and computational costs of these approaches are depended on the size of dataset considered. 125

Definition 2: (Approximate Membership Query (AMQ)) For a given set S = {x1 , x2 , ..., xN } with N items, AMQ checks the presence of queried element xq in the set by using some approximation or probabilistic approach for fast results. Query returns results with some approximation; here xq indicates “possibly in set” or “definitely not in set”. The query complexity is independent of the size

Jo

130

of dataset and space complexity is significantly reduced. 2.1. Bloom Filters Storing bulk data in a small space and querying the items in the given space

S can be accomplished by the usage of BF, a randomized data structure that

7

Journal Pre-proof

135

supports set membership query.

pro of

Bloom Filter (BF) [9], a space efficient probabilistic data structure, is used to represent a set S ⊂ U of n elements which helps in approximate membership testing. It consists of an array of m bits, denoted by BF [1, 2...m], initially all bits set to 0. The filter uses k independent hash functions ∀j(1
their value ∀j(1
urn a

lP

re-

1, ∀j| (1 < j < k).

Figure 3: Insertion in Bloom filter

145

Given an item yi ∈ Q where Q is set of query elements, its membership is checked by examining whether the bits corresponding to hash functions in the BF [1...m] array. If all hash positions, i.e.,

j≤k j=1 hj (yi )

j≤k j=1 Hj (yi )

are 1

are set to 1, then y

is considered to be part of S other wise not. Its space efficient representation comes at the cost of false positives, i.e., elements can be erroneously reported as members of the set although they are

Jo 150

not. In practice, the huge space savings often outweigh the false positives if kept at a sufficiently low rate. Given a BF with m bits and k hash functions, insertion and membership

query time complexity is always O(k). Their detailed working has been illus-

8

re-

pro of

Journal Pre-proof

Figure 4: Querying in Bloom filter

155

trated in Figs. 3 and 4.

lP

For getting the false positives probability of a BF, it is assumed that hash functions used in Bloom are universal random functions, i.e., probability for selecting each BF bit is equally likely [9, 10].

urn a

160

When inserting an element into the filter, the probability that a certain bit 1 is not set to one by a hash function is . Since k hash functions are used, m 1 probability that none of them sets the specific bit to one is (1 − )k . m After performing n insertions in BF, the probability that a given bit is still zero is:

(1 − (1/m))

kn

(1)

Jo

And consequently the probability that the bit is one is: kn  1 1− 1− m

(2)

In querying process, if all the hash positions corresponding to the hash func-

165

tions ∀j(1
belongs to the set.

9

Journal Pre-proof

The probability of false positive, i.e., probability when the element is not

pro of

part of the set and BF claims thta it is part of set is given by:  kn !k 1 1− 1− m

(3)

Using approximation principle it can be concluded that: 

1 1− 1− m

kn !k

≈ (1 − e−kn/m )k

(4)

Thus false positive rate for BF is given by:

re-

fp = (1 − e−kn/m )k

(5)

Example: To illustrate the above stated proof for false positives in BF in better manner, lets assume a BF of size 109 bits needs to accommodate 108 elements

lP

using 5 hash functions. False positives for the given scenario are: 1 Probability that while inserting an element, certain bit is not set to 1 in BF is ( 1019 ).

2 Probability that certain bit is set to 1 (in step 1), is

1−

1 109

urn a

0.999999999.

3 Using 5 hash functions probability that certain bit is set to 1 is 1− 0.999999995.



 1 5 109

=

=

4 After inserting 108 elements in BF, probability that a given bit is still zero  5×108 = 0.393469332. is: 1 − 1 − 1019

Jo

5 In querying operation in BF, query is done for all hash functions. So prob 5×108 5 ability of false positive is given as: 1 − 1 − 1019 = 0.009430928.

The false positive probability, fp , decreases as the size of the BF (m) increases

and fp increases as the number of elements increase. By using more hash functions the probability of collision decreases. Further, user can predefine false positives according to application’s requirement. The accuracy of BF depends 10

Journal Pre-proof

on the filter size m, the number of hash functions k, and the number of elements

follows:

pro of

n. To minimize the fp with respect to k, the optimal value of kopt is given as

kopt = (m/n)ln2

(6)

The space advantage of BF is dependent on the error rate acceptable for the application considered. To maintain a fixed fp , with the n number of elements, the size of a BF m is given by:

m = −(n × ln(p))/((ln(2))2 )

170

re-

2.1.1. Categories of Bloom Filter (BF)

(7)

Based upon the applications domains, many variants of BF have been proposed which can be broadly classified into four categories. • Static Bloom Filter (SBF): BF with fixed size of array is known as SBF.

lP

This type of BF has constant false positive rate and works efficiently on static data sets. Based on the number of elements(n), parameters like size 175

of BF (m) and number of hash functions used(k) can be decided. Moreover, another essential determinant is the potential range of the elements

urn a

to be inserted; if it is limited, a deterministic bit vector can do better. Distance-sensitive BF [11], weighted BF [12], etc., are some of the BFs which falls under this category. Some of the problems associated with

180

SBF are that collision rate increases exponentially as size of the incoming data increases and deletion operation is not allowed because a particular bit which is set to one earlier will be set to zero after deletion operation but this will unmark another elements too which had marked that bit as

Jo

one. Standard BFs are suitable for representing static Big data where size

185

is known in advance, i.e., it does not varies with time.

• Counting Bloom Filter (CBF): CBF introduced by Fan et al. [13] uses a counter instead of a bit array. Here, each position in array is a counter, allowing insertion and deletion operations on the CBF. Whenever an element is added or deleted from the CBF, the corresponding counters are 11

Journal Pre-proof

190

incremented or decremented respectively, eg. d-left CBF [14]. Although it

pro of

can be efficiently used for applications where deletion operation is required, it increases the memory overhead by a larger factor and determining value of counter is quite cumbersome process.

• Incremental Bloom Filter (IBF): The BFs which are adaptive in nature, 195

i.e., change their size according to the incoming data fall in this category. The basic idea of incremental BFs is to represent a dynamic set D with a dynamic bit matrix and accommodate incoming data by adding new filters at runtime. If rough estimate of the number of elements to be inserted

200

re-

is not available, then a hash table or an incremental BFs is the better option. Dynamic BFs [15], scalable BFs [16], etc., belong to this category. However, major drawback of these types of BFs is that query complexity increases as the size increases. Initial size of filter is important factor

lP

in such cases; assigning small initial size array leads to computational overhead, slice addition and query complexity overhead; on the other hand 205

using larger size for initial dynamic BF size may lead to memory wastage. • Ageing Bloom Filter (ABF): Some network applications require high-speed

urn a

processing of packets. For this purpose, BFs array should reside in a fast and small memory. In such cases, due to the limited memory size, stale data in the BF needs to be deleted to make space for new data. The answer

210

to such type of applications is Ageing BFs. These BFs work similar to Least Recently Used (LRU) cache. Stable BF[17], A2 buffering, double

buffering [18], etc., are some of the examples of ABF. In A2 buffering,

concept of buffering is used but with two filters. Initially data is filled

Jo

in first filter and once the threshold exceeded, data is filled in next filter

215

but as soon as the threshold of second BF is crossed, data is evicted from first filter and this process continues. Advantage of this approach is that we can store data for more time by using double memory than simple buffering approach. Major flaw in such filters is that size of filter used is static so approximate estimate of incoming data is required in advance 12

Journal Pre-proof

220

to decide the size to filter. Moreover, sometimes these filters show false

pro of

negative results. 2.1.2. Applications

Initially BF was used to represent words in a dictionary. Gradually, BF was widely used in many networking and security algorithms like authentica225

tion, IP trace-backing, string matching, reply protection [10], etc. Presently it is used in fields as diverse as accounting, monitoring, load balancing, policy enforcement, routing, clustering, security of network [19]. Cellular networks use

re-

device-to-device communication using BF based approach to identify mobile applications [20]. BF is applied in VANET applications and cloud platforms 230

for DDoS attack prevention [21] and privacy preservation [22]. Recently BF has been implemented for Controller Area Network (CAN) for efficient Intrusion Detection which prevents various replay and modification attacks [23]. A

lP

solution for Hot spot tracking problem for streaming data has been proposed using Time-decaying Bloom Filters (TBF) and online sampling technology [24]. 235

Bloom filters are also being used in the field of bioinformatics to classify DNA sequences. A solution based on Multiple Bloom Filters (MBFs) is implemented

urn a

for Pattern Matching in DNA Sequencing data which optimizes the search for location and frequency of a specified pattern [25]. Industrial applications: Technical giants are also using BF and its variants in 240

different fields. Quora uses a shared BF in the feed back-end to filter out stories that people have seen before. In Facebook, type-ahead search, fetching friends and friends of friends to a user typed query is performed using BF. It uses a BF to avoid redirecting users to malicious websites. Oracle uses BFs to perform

Jo

Bloom pruning of partitions for certain queries. Apache HBase uses BF to

245

boost read speed by filtering out unnecessary disk reads of HFile blocks which do not contain a particular row or column [26]. Few concerns related to BF are: it is not possible to retrieve original key after hashing in BF and there is a probability of false positives and false negatives in some variants which support

13

Journal Pre-proof

250

pro of

deletion. Many variants Fuzzy-folded Bloom Filter [27], ID Bloom Filter [28], UltraFast Bloom Filter [29], r-Dimensional Bloom Filter [30], Magic Cube Bloom Filter (MCBF) [31] etc. have been designed and discussed in literature based on various applications and their requirements. Table 1 provides a detailed literature review of variants of BF along with their application domains. These 255

variants allow the end users to choose one of the variants of BFs based upon

Jo

urn a

lP

re-

their usage in different applications.

14

Chang et al. [39]

[36]

10.

Shanmugasundaram et al.

7.

Zhong et al. [38]

Goh E. J. [35]

6.

9.

Bloom Filter

Kumar et al. [34]

5.

Chazelle et al. [37]

Cohen and Matias [33]

4.

8.

Hierarchical

Michael Mitzenmacher [32]

3.

SBF

Bloom

Bloom

Filter Banks

Filter

Split

Bloomeir Filter

Filter

Secure

Bloom Filter

Space-coded

filter

Spectral Bloom

Bloom filter

Compressed

Bloom Filter

Counting

SBF

IBF

SBF

SBF

SBF

SBF

CBF

SBF

CBF

re-

• Membership query operation in

• Matrix of s × n is used, where s is

Continued on next page

works

mines which element of set S matches with element X.

• Routing and forwarding in net-

• Mapping of elements and sets deter-

number of BFs used.

mated cardinality of data set and n is

dynamically increasing data set.

stored in BF

• Number of BF are used in pipeline

predefined constant based on the esti-

• Membership of function values

• Sub string matching

• Privacy preserving applications

• Blind streaming

pro of

• Anomaly detection

• Measure per-flow traffic

• Supports frequency query

• Storing multisets

• Encode functions in BF

• Low false positive rate

• Hierarchical construction of BFs

random function twice to each element

• Secures indexes by applying pseduo

frequency query

• Represents a multi-set and support

ment

obtain the minimum count of an ele-

value corresponding to hash values to

• Filter increases the smallest counter

routing table over network

• P2P networks and distributing

• Web cache

• Compression of BF for transmission purpose

• Supports frequency query

• Used in conjunction with web

• Frequency query for an element can caches

naries

to represent sets

of simple bits

• Search in databases and dictio-

• Uses compact and probabilistic way

be answered by using counters instead

Areas of applications

Special Features

lP

Category

urn a

Fan et al. [13]

2.

Bloom Filter

Burton H. Bloom [9]

Filter Name

1.

Jo

S.No. Authors

Table 1: Variants of Bloom Filter

Journal Pre-proof

Almeida et al. [16]

18.

[11]

Bruck et al. [42]

Kirsch and Mitzenmacher

15.

17.

Donnet et al. [41]

14.

Bruck et al. [12]

sensitive

Deng et al. [17]

13.

16.

Distance-

Bonomi et al. [14]

12.

Sig-

CBF

ABF

Bloom

Fil-

Filter

Scalable Bloom

Bloom Filter

Adaptive

Bloom Filter

Weighted

ter

Bloom

Bloom Filter

Retouched

filter

Stable

Bloom Filter

d- left counting

IBF

CBF

SBF

SBF

SBF

ABF number of zeros remain same in the

• Ensures that, with time, expected

• Less number of collisions are reported

Bloom filter in sub-tables.

• Used with counters by dividing

topology

applications

where information about large sets

network

• Used efficiently in distributed

data

• Suitable for dynamic or stream-

• Changes the size of BF according to

positive rate

per bound to maintain constant false

Continued on next page

ing Big data membership query

apriori

input requirement under a tighter up-

per bound of data is not known

• It uses Adaptive counters

• Suited for applications where up-

• Uses varying number of hash functions

• Zipf distributed data sets

which are queried more frequently

• Used in binary classification

database applications

ment improvement in network and

• Used for speed and space require-

route tracing monitors.

of nodes must be shared among

pro of

• More bits are allocated to elements

implemented using LSH

• Identifies closeness of an item in S,

• Less false negatives reported

• Random Bit clearing process

re-

ously seen in streaming set S.

• Helps to identify whether X is previ-

BF.

• Duplicate detection in streaming

•Fingerprint matching

• Elements lookups

• Network Flow management

• Works as Double Buffer. • Based on the flow of data

Areas of applications

Special Features

lP

Category

urn a

natures

Length

Lu et al. [40]

Variable

Filter Name

11.

Jo

S.No. Authors

Table 1 – Continued from previous page

Journal Pre-proof

Goel and Gupta [45]

Rothenberg et al. [46]

23.

24.

[18]

Guo et al. [15]

Bloom Filter

Kirsch and Mitzenmacher

21.

22.

Less

Ahmadi and Wong [44]

20.

conscious

SBF

SBF

Bloom Filter

Deletable

Filter

Layered Bloom

Bloom Filter

Dynamic

hash

Bloom Filter

Optimized

Memory-

CBF

SBF

IBF

SBF

re-

by adding same size filter as the origi-

the amount of incoming data increases

load balancer, firewalls, etc.

• Supports deletion without false neg-

Continued on next page

• Used in middlebox services like

ative

ing loops

gion with high collisions

• Used in source routing for avoid-

large Big data

• Supports frequency query for

tributed environment

ment removal by keeping record of re-

• Use probabilistic approach for ele-

layers, starting from deepest one.

element is added by querying multiple

• Keeps track of number of time the

ers.

• A layered BF with multiple BF lay-

nal one

• Used in dynamic sets in dis-

• Streaming data monitoring

work

• Cryptographic process in net-

tional cost is a critical factor.

• In applications where computa-

search.

classifier based on the tuple space

• Used in multidimensional packet

pro of

• Dynamically increases size of BF as

hashing

• Uses double hashing and partition

random hashing

• Uses single hash function through

tuning

• Popularity awareness with off line

for rarely queried ones

ing skewed distribution

• Show good results for data hav-

• Uses more hash functions for important elements and less hash functions

Areas of applications

Special Features

lP

Category

urn a

Bloom Filter

ity

Zhong et al. [43]

Data Popular-

Filter Name

19.

Jo

S.No. Authors

Table 1 – Continued from previous page

Journal Pre-proof

Francesco Concas et al.

27.

Francesco Concas et al.

30.

31.

A. Singh and S. Batra [51]

29.

Y. Hua and X. Liu [52]

[49]

Michael Mitzenmacher [50]

28.

[49]

Dautrich et al. [48]

26.

Bloom Filter

Generalized

Filter Name

ABF

SBF

• Document Search. • Useful when the query stream

only for encoding multiple sets

• BF is enhanced by making use of pre-

which makes use of bit-wise vectors and locality-sensitive hash functions.

• Extension of standard Bloom filter

tainty. Continued on next page

Big data with noise and uncer-

imate membership testing on the

• It can efficiently handle approx-

• Document Search

head.

• Web caching

nificantly in reducing memory over-

• Load Balancing

variable length which contributes sig-

• Bloom vector uses multiple BF of

iteration.

predict the size of Bloom filter for next

utilization in streaming data

pro of

array in which Kalman filter is used to

Bloom filter by making use of learning

• It performs better than Scalable

sent.

• Peak hour analysis and server

fixed distribution.

chine learning so as to model the data

sets that the Bloom filter has to repre-

can be modelled as coming from a

filter which is based on applying ma-

re-

• Web caching

• Load Balancing

less sensors and data stream.

tion and hint base routing in wire-

• Used for duplicate element detec-

utilizes equal-sized Bloom filters not

• Extension of standard Bloom Filter

Bloom Filter

SBF

SBF

IBF

SBF

SBF

• Works on time window protocol.

Sensitive

Locality-

Bloom Vector

Bloom Filter

Adaptable

Filter

Learned Bloom

Bloom Matrix

Bloom Filter

Decaying

is mandatory to make more space

ing them. in the filter.

applications where eviction of data

• Used to reduce error in streaming

• Two set of hash functions are used: one for setting bits, another for reset-

Areas of applications

Special Features

lP

Category

urn a

Laufer et al. [47]

25.

Jo

S.No. Authors

Table 1 – Continued from previous page

Journal Pre-proof

Jo

Mousavi

Singh et al. [27]

Liu et al. [28]

Lu et al. [29]

Patgiri et al. [30]

35.

36.

37.

M.Tripunitara [54]

N.

and

Bloom Filter

r-Dimensional

Bloom Filter

Ultra-Fast

ter

ID Bloom Fil-

Bloom Filter

Fuzzy-folded

IBF

SBF

SBF

ABF

lP re-

in every aspect

• Performs better than Cuckoo Filter

ability

• Exhibits high adaptability and scal-

positive rate

• It is quite fast and shows less false

terms of membership query speed

• It outperforms other variants in

Multiple Data (SIMD) techniques.

• Used for real-world network ap-

of Things environment (IIoT)

Continued on next page

ship queries

• Suitable for large scale member-

for efficient network link speed

• Applications in routers/switches

plications

pro of

• It makes use of Single Instruction

spective positions.

• It directly records the set ID at re-

positions in a filter

• Maps each element to k particular

another

commodate hashed data of one BF into

• Fuzzy operations are utilized to ac-

approach

• Incorporates fuzzy-enabled folding

BF.

• Finds usage in Industrial Internet

to be minimum

is used for differentiating between the sets that were a failure by the previous

cation where acceptable error need

• Access enforcement in authenti-

• A cascade of multiple Bloom fil-

filter

Cascade Bloom ters is employed wherein the next BF

tured P2P networks

dard BF as leaves.

IBF

• Global collaboration in unstruc-

sists of a tree structure having stan-

ter

• Informed routing in P2P Net-

• Dynamic structure based on the conworking

Areas of applications

Special Features

cept of Bloom partition tree which con-

IBF

Category

tion Bloom fil-

Dynamic Parti-

Filter Name

urn a

Sidharth Negi et al. [53]

34.

33.

32.

S.No. Authors

Table 1 – Continued from previous page

Journal Pre-proof

38.

Bloom

SBF

Category

urn a

(MCBF)

Cube Filter

Magic

Filter Name

use of spatial locality

• Improves the query speed by making

redistributing the items

• It leads to improve in accuracy by

ship queries for Multiple Sets

• Used for carrying out member-

• Items belonging to the same set are stored in different Bloom filters

Areas of applications

Special Features

lP re-

pro of

SBF: Standard Bloom Filter, CBF: Counting Bloom Filter, IBF: Incremental Bloom Filter, and ABF: Ageing Bloom Filter

Sun et al. [31]

Jo

S.No. Authors

Table 1 – Continued from previous page

Journal Pre-proof

Journal Pre-proof

2.2. Quotient Filter

pro of

BF and its variants work efficiently only when entire BF array resides in the main memory. If the size of BF exceeds the available RAM of system then 260

complexity due to number of operations required for fetching from main memory and checking different parts increases the query time manifold. If such process continues, BF will lose the purpose of its use. Another PDS named Quotient Filter (QF) [55] can be efficiently used for approximate membership query; it supports multi layer design, buffering and hash localization which results in fast

265

and efficient querying of the elements even in secondary memory.

re-

QF is a space efficient and cache friendly data structure which use quotienting technique of hashing to store a set S efficiently [56]. It supports insertion, deletion, querying and merging operations. The detailed working of the same is shown in Fig. 5.

lP

Each element x ∈ S is mapped to h(x), a primary hash function, for convert-

ing x into set of p bits called fingerprint of x, i.e., h(x) 7→ {0, ..., 2p −1} ⇒ f p(x).

f p(x) is an open hash table of size m = 2q buckets having (r + 3) bits per

bucket. It is used for storage where r denotes the least significant bits in f p(x) and q = (p − r) most significant bits in quotienting technique represent three

urn a

extra metadata bits used with each element. To insert fingerprint of an element f p(x) in QF , remainder fr ← (f p(x) mod 2r ) and quotient fq ← (bf p(x)/2r c) are computed, where fq denotes the index of bucket to be used for insertion and fr denotes the value to be inserted in bucket fq . The main advantage of QF over BF is that we can reconstruct the f p(x) from the fq and fr , where f p(x) is given by:

Jo

f p(x) = fq .2r + fr

270

(8)

Definition 3: In a quotient filter, for two given fingerprints f p(x) and f p(y), it is stated that if fq (x) < fq (y) then fr (x) is always stored before fr (y). The quotienting technique tries to generate a unique remainder and quotient

for every xi ∈ S although there are some chances of collision. On the basis of

21

re-

pro of

Journal Pre-proof

Figure 5: Quotient Filter [55]

remainder and quotient of f p(x), collisions are divided into two types:

lP

   Soft, if fq (x) = fq (y)    Collisions in QF = Hard, if {(fq (x) = fq (y))      &&(fr (x) = fr (y)}

(9)

In case of soft collision (when fq of two items collide, but they have distinct

urn a

fr ), linear probing is used as a collision resolution strategy, where remainders of different fingerprints having same fq are stored contiguously; called run in QF. If necessary, remainders associated with different fq are shifted and corresponding metadata bits are updated for each bucket. In QF, cluster refers to the sequence of one or more consecutive runs without having any empty bucket in between them. A cluster is immediately followed by a empty slot. Canonical slot for a fingerprint x ( f p(x)) is the original bucket for the insertion of fr (x) indicated

Jo

by fq (x). The terms run and cluster are used to identify suitable position to insert or query the element after various shifts have been performed with the use of metadata bits. The false positive in QF are encountered due to hard collisions(when fq and fr of two items collide). Assuming that h(x) is

22

Journal Pre-proof

Table 2: Significance of bits in QF is

continua-

bit

tion bit

is bit

shifted

Significance

0

0

0

Empty Bucket

0

0

1

Bucket is holding start of run that has been shifted from

0

1

0

φ (Not used)

0

1

1

Bucket is holding continuation of run that has been

1

0

0

Bucket is holding start of run that is in same bucket.

1

0

1

Bucket Bi is holding start of run that has been shifted

pro of

is occupied

its quotient (fq ) bucket (canonical slot).

shifted from its quotient (fq ) bucket (canonical slot).

from its quotient (fq ) bucket (canonical slot). Bi is also occupied with some fr but its remainder is shifted right. 1

0

1

1

1

φ (Not used)

re-

1

Bucket Bi is holding continuation of run that has been

shifted from its quotient (fq ) bucket (canonical slot). Bi is also a slot in same run but its remainder is shifted right.

lP

distributed uniformly, the probability of hard collision(P rHC ) is given by: P rHC = 1 − (1 − 275

p 1 n ) ≈ (1 − e−n/2 ) p 2

(10)

Metadata bits are used to find optimal location of elements which have been shifted from canonical slot because of soft collision; fr of element belongs to run

urn a

of a slot fq which is stored at different location. Most significant bit, referred as is occupied bit is set HIGH for ith bucket if for any f p(x) ∈ S quotient satisfy fq = i condition, i.e., ith bucket is canonical slot for some element in dataset.

280

Middle bit known as is contunuation bit helps the decoder in searching process to identify group of items belonging to same bucket. Least significant bit named as is shifted bit is used to identify where the fr (associated with ith bucket) is stored. The significance of these bits is provided in Table 2.

Jo

fq denotes the index of bucket in which element needs to be inserted or

285

queried. In insertion operation, suitable position, sp, to insert the remainder is at the end of run of bucket denoted by fq . For this, all elements after sp are shifted to right, same operations are repeated till the end of the cluster and then element is inserted and metadata bits are updated. For query operation in QF (where queried element is xq ) f p(xq ) is calculated and then corresponding 23

Journal Pre-proof

290

quotient fq (xq ) and remainder fr (xq ) are computed. Start of cluster contain-

pro of

ing fq is identified and then the start of run corresponding to fq is identified. In querying process, instead of shifting elements only remainder of queried element is checked in concerned run. Deletion process is reverse of the insertion operation. 295

Biggest advantage of QF is that in QF original data, although hashed while storing, can be retrieved back through quotienting hashing technique.

The time required in insertion and deletion process can dominate the advantage of using QF since single cluster is scanned. In each operations Chernoff

re-

bound can be used to limit the size of cluster.

Definition 4: For a QF of m slots, if number of items stored is α × m, then P r[∃(A cluster of length) ≥ k] < m−

(11)

lP

where  is allowable error and α ∈ [0, 1) is a random variable. k, the limit of cluster length (derived from number of slots in QF [55]) is given by : k = (1 + )

300

ln(m) (α − 1)ln(α − 1)

(12)

The length of largest cluster can be controlled by setting value of m high and

urn a

α → 1.

2.2.1. Advantages of QF over BF • In QF all operations are cache friendly, only single cluster is modified in one operation. Since cluster size can be fixed (Eq. 11), cluster fits into

305

cache lines easily. Less data fetch-up operation is required for bulk data stored in secondary memory. In BF, secondary memory fetching time for

Jo

concerned bit for all hash functions increases the complexity of task.

• Since QF supports in-order or linear scan, results are obtained quickly as compared to BF constructed by adding new slices to existing one, thus

310

search complexity of BF is comparatively high.

• Resizing of QF is possible without rearranging all the hashed data which is not possible in BF. 24

Journal Pre-proof

• Merging of two QF into a larger one can be done easily and false positives 315

pro of

do not increase in this operation whereas merging in BF may amplify the error.

• QF performs deletion operation accurately whereas standard BFs does not allow deletion and variants which support deletion may include false negatives.

Variants of QF like cascade filter (CF), buffered quotient filter (BQF), etc. work on similar principle and support working with SSD memory [55].

re-

320

2.2.2. Applications

QF is widely used in network application. Deep packet inspection (DPI) is a platform to monitor the incoming and outgoing traffic on a data centre. Identifying the malicious user from the packets is a time consuming task. Moreover, this matching process consume a lot of memory and CPU resources. Al-Hisnawi

lP

325

et al. used QF to store the malicious users to make searching task fast and efficient as the size of the incoming data increases [57]. Dutta et al. proposed Streaming Quotient Filter (SQF), a quotient filter based streaming model to

330

urn a

count the duplicate entries in the streaming data with predefined fixed memory and fast search facility [58]. QF has been successfully implemented in automatic terms extraction for domain-specific corpora for fast results. It has also been used for warehouse management to locate the items efficiently [59]. Garg et al. have proposed an application of Quotient filter in VANETs. It is utilized for providing an Edge Computing-Based Security Framework used for carrying out Big 335

data Analytics [60]. Quotient Filters have also been used for providing Quality

Jo

of Experience (QoE) in wireless content delivery networks (CDNs). They contribute significantly in improving the accuracy and reducing the effort involved in the caching process [61]. A new approach called Fast Two-dimensional filter with hash table (FTDF-HT) has been proposed for efficient name matching in

340

Named Data Networking (NDN). This approach is also based on the concept of Quotient Filter [62]. 25

Journal Pre-proof

3. Frequency Count

pro of

Given a set of duplicated values, one needs to estimate the frequency of each value. The estimation for relatively rare values can be imprecise, however, fre345

quent values and their absolute frequencies can be determined accurately. When frequency count needs to be solved in sub-linear space, some approximation in result is tolerable provided the processing is fast. In streaming data, frequent item counting is sometimes called - approximate frequent item counting which is defined in the next section.

350

Definition 5: (- approximate frequent item counting) Given a data stream

re-

S = {x1 , x2 , ..., xn } of n items, where F is the set of all xi ∈ S with frequency greater than certain threshold, i.e., fi > ((ϕ − ) ∗ n) and ϕ is a random variable

used to defined threshold; fi denotes the frequency of ith item with  as the

allowable error in the results.

Solutions provided for approximate frequent item counting are divided into

lP

355

two categories: counter based and sketches. Counter based solutions use counters and probabilistic counting mechanism in sub-linear space using fixed resources such as-memory and computational time. Frequent approximation al-

360

urn a

gorithm, Majority algorithm [63], Lossy-counting [64], Space-saving [65], etc., fall in this category. Sketches use hashing and approximation based algorithms to map a large data set into compact size s.t. size of sketch is much less then the size of dataset. Count-Min Sketch [66], Count Sketch [67], etc. fall in this category. A short survey of prevalent counting based algorithms is provided in Table III. 365

Among all the techniques mentioned in Table III, most robust, with less com-

Jo

putational cost, minimum memory requirement and most adaptive to answer frequency queries is a sketch data structure called Count Min Sketch (CMS). 3.1. Count-Min-Sketch CMS was proposed by Muthukrishnan and Cormode in 2003 and later im-

370

proved in 2005 [66]. It is one of the members in the family of memory efficient

26

Journal Pre-proof

PDS used to optimize the counting of the frequency of an element in lifetime

pro of

of a data set. It is a histogram in which one can store elements and associated counts. As compared to BFs, which represent sets, CMS considers multi-sets, i.e., instead of storing a single bit, CMS maintains a count of all objects. It 375

is called ‘sketch’ because it is a smaller summarization of a larger data set. The probabilistic component of CMS helps in achieving more accurate results in cardinality estimate as compared to counting BF which works with less space and time complexity. The counting BF works with one bloom filter of size m having maximum counter value MAX, while all k hash functions are updating in the same BF, which leads to more collisions and chances of more error in

re-

380

cardinality estimate using counting BF. In CMS combination of d BFs is used, in each row i.e. in each BF only one hash function is allowed to make changes and final decision of cardinality estimate is taken by considering all rows. All

385

lP

these optimizations in CMS help to reduce the deviation in cardinality estimate. Insertion process in CMS is similar to BF. Instead of using 1-D array, CMS uses 2-D array with w columns and d rows. These parameters are used to maintain the trade-off between space and time constraints and accuracy. Since one hash function Hi (.) is associated with each row i, d hash functions are

390

urn a

used for d rows . When an element x arrives, it is hashed to each row, i.e., ∀i(1
Jo

operations about insertion in CMS has been depicted in Fig. 6.

Figure 6: Insertion in Count Min Sketch

27

Journal Pre-proof

For the desired accuracy levels, two parameters  (epsilon) and δ (delta) are

pro of

used to calculate w and d dimensions of a count min sketch.

 (epsilon) is the measure of ‘error added to counts with each item added to 395

the CM sketch’. δ (delta) defines ‘with what probability one allow the count estimate to vary from  error rate’.

Value of w and d are calculated as:

re-

e w=d e 

1 d = dln( )e δ

(13)

(14)

where ln is natural log and 0 e0 is Euler’s constant. 400

To decrease the collision, pairwise independence is used for constructing a

lP

universal hash family.

CMS solves three type of data summarization problems [68]. First is point estimation where frequency of object a[i] in stream is estimated value of number of occurrences of a[i]; calculated by taking the minimum of all the respective counter values in CMS corresponding to that element. The basic insight here

urn a

405

is that there is possibility of collisions between elements, which may increment the counters for multiple items. Taking the minimum count results in a closer

410

approximation. Second is range sum where total frequency count of elements Pk lying in a defined range is returned, i.e., i=j a[i]. Third application of CMS

is to identify heavy hitters: given a stream of data arriving and a constant φ, it can find all items occurring more than φ × N times, i.e., ∀i, find a[i] > φ × N .

Jo

3.2. Count Min Sketch Analysis For an incoming data stream D and an element xi , actual frequency is

denoted by ai and a ˆi is estimated frequency of element by CMS, where , δ ∈

415

(0, 1) are accuracy parameter and confidence parameter respectively for a CMS

of w × d size [69]. 28

Journal Pre-proof

yj , a random variable, gives element count if hashed values for two different

yj =

pro of

objects H(ith ) object and H(j th ) object are equal but i 6= j :   aj , if h(xi ) = h(xj )  0,

(15)

otherwise

Estimated frequency aˆi is sum of actual count of object i (a constant value) 420

and count of object j having hash collisions with object i; the expected value of aˆi is:

X

xj

re-

aˆi = ai +

(16)

j6=i

E[aˆi ] = E[ai +

X

xj ]

(17)

E[xj ]

(18)

j6=i

lP

E[aˆi ] = ai +

X j6=i

Using values of xj in Eq. (18), E[xj ] for w × 1 counters is given by: E[xj ] = aj ∗ P [h(yi ) = h(yi )] + 0 ∗ P [h(yi ) 6= h(yi )]

(19)

urn a

With w × 1 counters, probability that collision will occur in i and j for a hash function is

E[xj ] =

aj w

(20)

Using results of Eq. (20) in Eq. (18), we get:

Jo

425

1 w.

E[aˆi ] ≤ ai +

X aj j6=i

E[aˆi ] ≤ ai +

where ||a||1 is L1 norm of ai ; ||a||1 =

w

||a||1 w Pn

i=1

(21)

(22) |ai |. Higher the value of w,

more will be the accuracy of CMS and more is the memory required for higher accuracy. 29

Journal Pre-proof

Using Markov inequality for c>0

430

pro of

  ||a||1 1 P {aˆi − ai } ≤ c ≤ w c

(23)

For a given value of |0 <  < 1, and w = d e e

P [aˆi > ai + ||a||1 ] ≤

1 e

(24)

CMS uses O( 1 ) spaces, i.e., ≈ O( we ) and estimates frequency with error at most ||a||1 with probability at least (1 − δ), where δ =

1 e.

The above mentioned

analysis is for w × 1 CMS but one may have CMS with multiple rows, i.e.,

re-

w × d. P[Err] is probability of error in w × d CMS, where i is collision in ith row, given by:

P [Err] = P [∃i|i ] = 1 − P [∀i|¯i ]

(25)

lP

Since all estimates are independent of each other

P [Err] ≤ 1 −

1 ed

(26)

urn a

Confidence of getting error probability(δ) equal to (1 − δ), is given by:   1 d = ln δ

(27)

Final conclusion from the above analysis is that if d columns are maintained, the probability that the estimate deviates by at most ||a||i is at least (1 − δ). 435

3.2.1. Applications

Consider a situation where one has stream of data, e.g., updates to stock quotes in a financial processing system is arriving continuously which needs

Jo

to be processed and statistical queries are suppose to be answered in real-time. For efficient handling of such scenarios, one requires to perform fast and efficient

440

processing of streaming data in a single pass. CMS is quite useful in answering frequency query in such problems using small space with constant query time. CMS has been successfully used in graph base semi-supervised learning for large

30

Journal Pre-proof

scale data[70], finding Hot-IP and DDoS attacker in networks algorithms [71].

445

pro of

CMS-Tree, a data structure derived from CMS, has number of applications in Natural Language Processing [72]. Bonelli et al. have presented a counting framework based on probabilistic sketches and LogLog counters for estimating the cardinality of large multi-sets of data [73]. It can be efficiently used for online chain of processing of network devices running at multi-gigabit speeds. Zhu et al. have proposed an approach called Dynamic Count-Min sketch (DCM), 450

which is appropriate for dynamic data set and can provide accurate estimates

Jo

urn a

lP

re-

for point query and self-join size query [74].

31

3.

Motwani [64]

Manku

and

and

Boyer

2.

Moore [63]

Jo

Lossy Counting

rithm

Frequent algo-

Alogrithm

Majority

Based

Counter

Based

Counter

based

Counter

lP O(1)

O(1)

O(1)

exceeds

( k1 )

responding count

• Count is maintained for only those

tion of each bucket.

the extreme sides is done after calcula-

• Random decrement of all counters on

previous bucket CBn−1 is used as base.

• For the new bucket Bn , counter of

old in bucket counters CBi .

Continued on next page

data set and their cor-

elements which cross a defined thresh-

of all unique items in

• It keeps the track

fraction of total counts.

quency

of items whose fre-

• Provide the sequence

votes in n items

• Used to find majority

be applied

Areas where it can

ent elements.

O( 1 )

Query

Complexity

and

Update

pro of

Com-

and calculates the frequency of differ-

• It divides large data into Bi buckets

is not found.

O( 1 )

O(k)

plexity

Space

re-

decrements all the counters when item

Stores values in k counters only and

• It increments counter if item exists.

rithm.

• It is generalization of majority algo-

not selected is decremented by 1.

cremented, else value of index which is

• For exiting item counter value is in-

mented.

item is stored and counter is incre-

• If item is observed first time then

value zero for each item.

• It starts with counter having initial

Category Special Features

urn a

J. S. Moore [75]

1.

Variant

Name of the

S.No. Authors

Table 3: Variants of Counting Algorithm

Journal Pre-proof

6.

5.

4.

and

al. [65]

Mentwally

[66]

et

Min-

Space Saving

Sketch

Count

Count Sketch

Variant

Based

Counter

Based

Sketch

Based

Sketch

Com-

lP O( 1 )

O(1)

only k counters.

• When a new distinct item arrives, placed by it and counter is set 1.

Continued on next page

more than  × n using

item with least counter value is re-

items having frequency

responding counters.

• It keeps the record of

rameters  and δ

ror bound on the pa-

the pre-calculated er-

pro of

• It also guarantees

stored in sketch.

the Big data which are

all frequency queries on

• It helps to answer

items are stored by updating their cor-

• It stores k items only, first k distinct

quency of queried element.

from all the rows is picked as the fre-

• In querying process minimum value

rows.

date the value of new item in all the

matrix a hash function is used to up-

re-

sketch, corresponding to each row in

• A w × d matrix is used to store the

i=d [h(x)] 7→ {+1, −1}. eration), i.e., gi=1

O(log( n )) δ

items.

be performed on selected index (+1 is

log(n/δ) ) 

expected frequencies of

for increment and -1 for decrement op-

variance in actual and

• Use of extra set of

decide which type of operation needs to

O(log( n )) δ

be applied

Areas where it can

• It uses extra set of hash functions to

O(

Query

Complexity

and

Update

hash functions reduces

log(n/δ) O( min(2 ,1/k) )

plexity

Space

sketch is maintained.

• A 2-d matrix similar to Count min-

Category Special Features

urn a

Name of the

Jo

Muthukrishan

Cormode

[67]

Charikar et al.

S.No. Authors

Table 3 – Continued from previous page

Journal Pre-proof

7.

al. [76]

Matusevyc

S.No. Authors

et

Holusai

Variant

Based

Sketch

modates more number items in same space

half of the space with negligible error in

accuracy by applying fold operation.

• The folding operation allows merg-

ing of data according to time and items

tion and item aggregation.

Query

number

of folds

the

depends

on

time

complexity

Query

is O(log( n )); δ

Update time

Complexity

and

Update

time.

queries as a function of

events, e.g. streams of

statistics of arbitrary

• It provides real time

be applied

Areas where it can

pro of

of

re-

respectively; referred as time aggrega-

lP

but

fined threshold, data is preserved in

accom-

same as CMS

When counters of CMS reache a de-

is

space

Com-

complexity

Total

plexity

Space

representation of Count min-sketch.

• It is the advanced and compact

Category Special Features

urn a

Jo

Name of the

Table 3 – Continued from previous page

Journal Pre-proof

Journal Pre-proof

4. Cardinality Estimate

pro of

As the amount of data to be analysed increases, determining the cardinality can be an important factor, especially when the incoming data is dynamic and 455

amount of data is unknown. In multi-sets, determining exact cardinality is highly computation intensive process since it is proportional to the number of elements in the large data sets.

Probabilistic cardinality estimators used for determining approximate cardinality include LogLog [77], HyperLogLog [78], MinCount [79], Probabilistic 460

counting, etc. [80]. The same has been illustrated in Table 4. In all these esti-

re-

mators, hash functions are used to ensure randomization, leading to significant reduction in memory utilization and the cost paid is that approximate output is obtained instead of the exact one. These probabilistic estimators are based on two approaches: First is Bit-pattern observables (BPO), where certain patterns of bits obtained after hashing are observed and conclusions are drawn based

lP

465

on the these patterns. Probabilistic Counting, LogLog counter, Hyper LogLog counter, etc. use BPO principle for estimating the cardinality of set. Second approach is statistics based where Statistical Probability Methods (SPM) are

470

urn a

used to find the cardinality which include MinCount and approximate counting algorithm.

One of the most frequently used cardinality estimators for massive Big data is LogLog counter, a probabilistic counting based algorithm which uses 16 bits hash function to randomize data and convert it into uniform binary format [77]. The hashed data set obtained is used for cardinality estimates. Estimator 475

used in LogLog counter is the geometric mean of all the registers and Bernoulli

Jo

distribution is used to provide the final cardinality of the dataset. It estimates the cardinality in single pass and within defined error limits using small memory than the size of dataset. One of the major drawbacks of LogLog counter is that it is not efficient for the Big data containing outliers. HyperLogLog [78], an

480

advanced version of LogLog counter uses principle of stochastic averaging where 32 bits or 64 bits hash function is used compared to LogLog counter which uses

35

Journal Pre-proof

16 bits and considers harmonic mean of all the registers to eliminate the effect

pro of

of outliers. 4.1. Hyperloglog Counter 485

HyperLogLog (HLL) estimates the cardinality of large set by using small memory with the fixed number of registers r of size mr , where size determines the capacity of register to store count and all parameters are function of expected approximation. It has been proved by Flajolet et al. that HLL counter can count one billion distinct items with an error of 2% using only 1.5 KB of memory [78] (Proof of HLL accuracy is discussed in Appendix 1). In terms of functionality,

re-

490

HLL supports addition of elements and estimation of their cardinality but it does not support membership checking of specific elements as done in BFs and QFs.

495

lP

HLL algorithm is based on Bit-pattern observable principle, i.e., the cardinality of a uniform distributed multi-set (Z) of numbers is estimated by calculating the maximum number of leading zeros in the binary representation of each number in the Z. If the maximum number of leading zeros observed in beginning is n − 1, i.e., 0n−1 1, an estimate for the number of distinct elements 500

urn a

in the Z is 2n . If single counter is used for estimation then variance in result will be quite high. The solution proposed is to run same experiments m times with different hash functions and then take average; this reduces the variance and provides better estimation [81]. HLL uses principle of stochastic averaging where input stream is divided into r sub streams and if the standard deviation for each sub-stream is σ, then

505

the standard deviation for the averaged value is

σ √ . r

Also, harmonic mean is

Jo

used instead of average to normalize result and eliminate the effect of outliers. Such means have influence of taming probability distributions, i.e., have slow

decaying right tails, operating as a variance reduction device leading to more quality estimates.

510

The detailed working of HLL is shown in Fig. 7. Streaming data is managed

in input phase to compute the cardinality of dataset. This data is further 36

Journal Pre-proof

provided as input to hashing phase, where each data instance is hashed into

pro of

binary string of l bits. From these l bits, lower b bits are used to determine the register number to be updated and remaining (l − b) bits determine the 515

value to be updated in register. In register block r, registers having maximum counter value mr are maintained. Value of these registers are continuously updated according to hashed value of data instances. To provide the cardinality estimate after observing certain amount of data, estimator function is used, in which harmonic mean of all register values is considered to reduce the variance in the cardinality estimation.

urn a

lP

re-

520

Figure 7: Hyper log log framework

Jo

The input multi-set Z is divided into r sub streams z1 , z2 ...zr , where r is

number of registers used to store the values, given by r ← 2b — b ∈ Z and b > 0 and all registers in set R are initially set to −∞. Only one hash function is used to convert domain data to binary stream for Bit-pattern observation, i.e.,

H(Z) : D −→ {0, 1}∞ . If s ∈ {0, 1}∞ is a binary stream then δ(s) is function

which returns position of leftmost 1 in s. ∀zi |zi ∈ R, hashing is done to convert 37

Journal Pre-proof

sub-stream into binary string α (α ← H(zi )). First b bits of α, i.e., α1...b are

pro of

used to determine the register ri to be updated and remaining bits are used for register value, i.e., ri ← δ(αb+1,... ). Estimator function (E) of HLL is based on harmonic mean:

  r X E = βr r 2  2−rj 

(28)

j=1

where βr is a constant based upon the size of data set used to correct the systematic multiplicative bias.

The algorithm makes adjustment for small and very large cardinality sets

525

re-

by adjusting the value of βr . Every register ri ∈ R uses at most log2 (log2 (n) + O(1)) bits when cardinalities less than or equal to n need to be estimated. The resulting error is

1.04 √ . r

Accuracy of estimate is improved by increasing the

4.1.1. Applications

lP

number of registers in HLL [78].

HLL finds usage in different application domains like natural language pro530

cessing [82], biological data [83], large structured databases mining [84], networks used for traffic monitoring [85], security related issues in networks like

urn a

detection of worm propagation and detecting DoS (Denial of Service) attack [86], data analytics [87], etc. A generalization of virtual Hyperloglog (vHLL) finds applications in network traffic measurement and database systems by uti535

lizing compact memory [88]. HLL has also proved to be useful in computing

Jo

fast and accurate Genomic Distances [89].

38

Flajolet et al. [80]

Durand and Flajolet

3.

4.

Fusy and Giroire [79]

Flajolet et al. [78]

7.

[77]

Durand and Flajolet

6.

5.

M. Wegman [91]

2.

[77]

Robert Morris [90]

1.

sampling

BPO

SPM

Category

counter

Hyper-Log-Log

Min/Max Count

Super Log Log counter

Log-Log counter

Probabilistic Counting

(Wegman sampling)

Adaptive

BPO

BPO

BPO

BPO

BPO

urn a

algorithm

Approximate Counting

Name of the Variant

Jo

Authors

S.No.

re-

Observable are

duced to eliminate the outliers.

averaging is used and harmonic mean is intro-

• A 32 bit hash function along with stochastic

ror rate.

counter with low latency and well defined er-

• Handle larger Big data compared to log-log

hashed values according to type of counting.

Continued on next page

etc. for cardinality estimation [78]

• Used by Google, Redis, Amazon

sive data sets

which consider multiple hash functions instead

of one and take Min or Max value from the

• Estimating cardinalities of mas-

sive data sets

pro of

• Estimating cardinalities of mas-

sive data sets

• Estimating cardinalities of mas-

• Data mining of Internet graphs

sets

• Estimating cardinalities of multi

• Extended form of probabilistic counting

ticular interval).

Restriction rule (use register values from a par-

tion rule (while selecting values register) and

• Log log counter is combined with Trunca-

cation and ordering of data in files.

linked with cardinality, independent of repli-

is selected from right side.

• Observable bit to set register value high

servable.

first 1 from leftmost side is considered as ob-

lP

• Hashing of data is done in binary format and

tribution of data.

of multisets and are independent from the dis-

• Samples are taken uniformly over the domain

• Uses probability based counters.

estimate (in one scan)

• Approximate and fast cardinality

• Developed in Bell labs on the basis of theory that log2 N bits need to be counted till N.

Areas of Applications

Special Features

Table 4: Variants of cardinality estimate

Journal Pre-proof

Heule et al. [92]

8.

Name of the Variant BPO

Category

ity estimation regime.

lP re-

pro of

representation in registers and small cardinal-

applications [92].

• Implemented by Google in many

• Improvement over hyper log log by implemented 64 bit hash functions leading to sparse

Areas of Applications

Special Features

SPM: Statistical Probability Methods, BPO: Bit-Patterns Observable

urn a

counter

Hyper-Log-Log++

Jo

Authors

S.No.

Table 4 – Continued from previous page

Journal Pre-proof

Journal Pre-proof

5. Similarity Search

pro of

Finding similar items in a set is a process of checking all items and identifying the closest one. To categorize the data set into particular class, one needs to 540

find how much two items of data set are similar to each other. Problems related to finding of similar items are often solved by identifying nearest neighbors of object. Such type of problems have number of mathematical solutions in terms of distance measures like hamming distance, cosine and sine similarity measure, Jaccard’s similarity coefficient, Pearson’s similarity coefficient, etc. [93].

545

Searching a huge database for similar items using linear search or brute force

re-

approach increases the computational complexity exponentially. Such solutions are efficient for small data sets but when massive data sets are considered, all such solutions face two major problems: first is how to store and represent items for finding similarity in massive data sets. Second is how to pairwise compare billions of items especially in such high dimensional data sets [94].

lP

550

Solution for above stated problem is to either reduce dimensionality of the data set or make structural assumptions about the data for maintaining integrity of data as done in data structures like trees and hashes. Trees show

555

urn a

good results for low dimensional data, but as the dimensions of data increases, query complexity and tree construction cost becomes too much. In hashes, data is mapped on hash table using random hash functions. Items mapped close in hash table are assumed to be close neighbor, but type of hash functions used and hashing collisions can significantly affect the results. Finding nearest neighbor in Big data with n points and d dimensions using linear search has O(nd) 560

complexity at its best.

Jo

One of the known solution proposed by researchers for finding nearest neighbor in Big data with n points and d dimensions is k-d tree. K-d trees or kdimensional tree, proposed by Jon Bentley in 1975, is a binary tree where every node is a k-dimensional point and each level of the tree represents one dimen-

565

sion. Each level of a k-d tree splits all children along a specific dimension, using a hyperplane that is perpendicular to the corresponding axis. For insertion or

41

Journal Pre-proof

query operation k-d tree needs recursive scanning of the tree which is quite

pro of

time consuming task. In the worst case, search is close to scanning whole tree. Balancing operation in k-d tree requires extra computational effort to generate 570

sorted k-d tree in multiple dimensions. With an increase in dimensions, new levels need to be added, increasing the size of kd-tree exponentially.

Further, nearest neighbour problem with approximation rules for d dimensional dataset is defined as following:

575

Definition 6: (c-approximate NN problem [95]) Let
re-

d dimensions and Q ⊂
a point p ∈
Definition 7: ((R,c)-NN problem [95]) Let
lP

580

d dimensions with a constant R > 0 and Q ⊂
any query q ∈ Q following decision related to closeness can be made: 0

• if ∃p ∈
585

urn a

c.R.

• if d(p, q) > c.R, ∀p ∈
In the era of Big data, problems like similarity search on large data needs reliable, fast and computationally efficient solutions. One of the solutions to the above mentioned similarity search problem is given by a hashing based sampling algorithm called Min Hash (MH).

Jo

590

Min-Hash, a probabilistic technique given by Andrei Broder in 1997 [96],

is used to find similarity between two items by computing Jaccard similarity J(A, B) between the items being considered for finding similarity. To find the similarity between members of a set S = s1 , s2 , ...sn Min-Hash

uses set of k hash functions Hk (S) 7→ Z. hmin which stores the minimum value 42

Journal Pre-proof

from the set of hash function hmin ← min(Hk (.)). Two elements s1 and s2 of

pro of

set S are considered similar if hmin (s1 ) = hmin (s2 ) [96]. P r[hmin (s1 ) = hmin (s2 )] ⇒

|s1 ∩ s2 | ≈ J(s1 , s2 ) |s1 ∪ s2 |

(29)

If mH is a random variable, similarity of two items s1 and s2 is given by:

mH =

595

  1, if h(s1 ) = h(s2 )  0, otherwise

(30)

Here mH ∈ (0, 1) is unbiased estimator of similarity. High variance in mH is

re-

reduced by averaging the number of observations.

Number of variants have been proposed to improve the performance of MinHash by maintaining its simplicity. Some important variants are: k-min Hash [97] : As compared to Min -Hash where single hash function is used, k-min Hash uses k hash functions and MIN and MAX are found. For two

lP

600

set A = {a1 , a2 ..., an } and B = {b1 , b2 , ..., bm }, Jaccard similarity between them using k hash functions is given by:

i=(M AX(n,m))

J(A, B) = ∀i=1

M IN (hi=k i=1 (ai , bi )) i=k (a , b )) M AX(hi=1 i i

(31)

urn a

It is more accurate as compared to Min-Hash because it uses k hash functions instead of single hash function.

Min-Hash Sketch (MHS) [98] : The term ‘sketch’ indicates summary of a large set. In Min-hash sketch, k hash functions are calculated for each set and k1 (k1 ⊂ k) hash functions with minimum values are stored in MHS matrix.

To compute similarity on a collection of sets, i.e., A¯ ← (A1 , A2 ...An ), MHS is

Jo

constructed as follows:

605

1−k1 M HS[i, j] ← ∀i=n (hj=k i=1 (M IN j=1 (Ai )))

(32)

This sketch is used to compute the similarity of any pair of documents by

comparing their associated minimum values. Weighted Min Hash [99]: This technique is used for similarity search in

textual data (applications domains such as document retrieval, text mining and 43

Journal Pre-proof

web search etc.) where entire text is divided into fixed size character sets using

pro of

shingling (discussed in coming paragraph) or varying size text called tokens. In weighted Min-hash different weights are assigned to the tokens generated from the text. Frequently used approach for weighting tokens in document retrieval is the Inverse Document Frequency (IDF). For token t, weight is computed as: w(t) = log(1 +

N ) nt

(33)

where nt is the number of times token t appeared in all N documents. Weighting approach maintains equilibrium by putting small weights on frequent tokens

re-

and large weights on rare tokens. This unequal assignment of token weights decreases the value of common tokens and allows more informative tokens to pop out, leading to significant improvement in accuracy of results retrieved. Jaccard similarity for weighted Min-hash is given by:

i=k M IN (hi=1 (ai , bi ))(w(t)) i=k M AX(hi=1 (ai , bi ))(w(t))

lP

JW (A, B) =

(34)

The weighted Jaccard similarity is a natural generalization of Jaccard similarity. It will become simple Jaccard similarity if all weights are set as 1. Variants of Min-hash are used for character level or lexical matching, not for contextual matching.

urn a

610

For finding similar items in massive Big data, Min-hash represents bulk data into compressed form called signature matrix and Locality Sensitive Hashing (LSH) is used to shortlist and narrow down pairwise comparison by identifying the pairs of possible similar items in the dataset. 5.1.1. Steps used in Min-hash

Shingling: A document is a string of characters. The most effective way

Jo

615

to represent documents as sets is convert it into small strings, for the purpose of identifying lexically similarity between documents. k-shingle represents a substring of length k found within the document. In shingling two major issues are faced: first is how to pick size of k and second is which method to usedto convert document into shingles. 44

Journal Pre-proof

Shingle is a contiguous sub-sequence of tokens of length k. (k can vary according

pro of

to application). However, if value for k selected is too small, then it is expected that most sequences of k characters will appear in most of the documents. Thus these type of shingles-sets lead to high Jaccard similarity between unrelated documents. But if value of k selected is very high, then matching of shingles with other documents have very low probability, which again leads to erroneous results. Thus, k should be large enough that probability of any given shingle appearing in any random document is low. Value of k is decided by:

(35)

re-

ck >> l

where c is number of available characters and l denotes the average length of document.

Considering second issues, there are many methods to convert documents into

620

lP

shingles like remove all spaces from the document and then pick string according to shingle size; another approach considers space character as well while calculating to shingles; in another approach first all stop words are removed from the document and then shingling is done; and in hashing based approach, instead of using substrings directly as shingles, we can pick a hash function

625

urn a

that maps strings of length k to some number of buckets and treat the resulting bucket number as the shingle. The set representing a document is then the set of integers that are bucket numbers of one or more k-shingles that appear in the document.

To perform similarity search between items, pre processing of data is required to adjust the data into the compressed form for space saving, uniformity and 630

fast results.

Jo

Characteristic Matrix (CM): Shingles computed for each document are hashed

to compute Jaccard similarity and the matrix set generated from them is called characteristic matrix. For a set of documents S = {D1 , D2 , ..., Dn }, each document having m singles, CM is defined as a binary (m × n) matrix. Rows

635

of CM represent the values of shingles and columns represent the documents. CM [i, j] = 1 in a matrix denotes that ith shingle is present in j th document. 45

Journal Pre-proof

Signature Matrix (SM): (s×n) signature matrix (SM) is derived from (m×n)

pro of

(CM ) having similarity same as that of entire set, where (m >> s). To generate SM, a hash function φ(.) is used which picks a row randomly from SM and 640

then rows are permuted across the columns to generate more random results. Repeating this process m times, a (m × n) signature matrix is generated. This process is followed since it is difficult to store the characteristic matrix and make pairwise comparisons for huge amount of entries. Min-Hashing is used for similarity preserving summarization of sets, i.e., compact representation of

645

large data set in a smaller one with minimum loss of information. SM generated

re-

by Min-Hash act as input for LSH. 5.1.2. Applications:

Initially Min-hash was proposed for document similarity search engine Altavista [100] for grouping similar documents. Later, it was frequently used for similarity search and document duplicate detection especially in web pages

lP

650

[93, 96]. Apart from documents, Min-hash is successfully used in different areas such as-comparing and calculating the distance between genome and meta genomes [101], in domain of image clustering to cluster near duplicates [102],

655

urn a

in clustering of graphs of large data bases like social network [103], in network security Min-hash based sequence classification models are used to detect malwares [104]. In Software Defined Networks (SDN), it is used to build a malicious code classification system [105]. It has also been used in hybridization with HLL to improve the performance of HLL. [78]. Min-hash along with LSH is used in many applications domains to reduce high dimension similarity search to low 660

dimension similarity search [106].

Jo

5.2. Locality-Sensitive Hashing (LSH) The basic principle used in LSH is projection of higher dimensional data in

low dimensions subspace, using the fact that points close in many dimensions remain close in two dimensions too.

665

Let xi ∈
Journal Pre-proof

be a family of hash functions, mapping
pro of

For any two points xi , xj ∈
670

then P1 and P2 are the probabilities that xi and xj will reside in same bucket. The family of H is called locality sensitive or (d1 , d2 , P1 , P2 )− sensitive [107].

Definition 8: A family H of hash functions is said to be (d1 , d2 , P1 , P2 )-sensitive [95] if:

• D||xi , xj || ≤ d1 then P rH [h(xi ) = h(xj )] ≥ P1

re-

675

• D||xi , xj || ≥ d2 then P rH [h(xi ) = h(xj )] ≤ P2

for all cases d1 < d2 and all queries satisfy P1 > P2 , here D||xi , xj || denotes the

lP

distance between two points. If xi and xj are close in
Jo

urn a

680

Figure 8: Probability v/s distance measure in locality sensitive hashing [93]

LSH is used to solve (R,c)-Nearest Neighbour (NN) problem. (R,c)-NN prob-

lem is decision version of c-approximate NN problem. For (R,c)-NN problem in Locality Hash function r1 = R, r2 = cR, where c > 0. 47

re-

pro of

Journal Pre-proof

Figure 9: Locality-Sensitive Hashing framework

Definition 9: (Distance Measures [108])
lP

685

dimensions. Distance measure function on
690

urn a

• D|x, y| = 0 if and only if (x = y)

• D|x, y| = D|y, x|, i.e., distance is symmetric. • D|x, y| ≤ D|x, z| + D|z, y|, triangle inequality or length of shortest path rule.

Number of variants of LSH have been proposed depending upon the universe on which original data is mapped, i.e., on the basis of distance coefficients which satisfies Definition 9 because every distance measure may not have a

Jo

695

corresponding LSH family. Depending on the random function chosen and its locality sensitive properties, LSH is divided into various categories which are discussed in Appendix 2. The key idea of the LSH approximate nearest neighbor (NN) algorithm is to

700

construct a set of hash functions such that the probability of nearby points being 48

Journal Pre-proof

close after transformation with the hash function is larger than the probability

pro of

of two distant points being close after the same transformation. The range space of the function is discretized into buckets and we say that there is a ‘collision’ when two points end up in the same bucket. 705

LSH works by using a carefully selected hash function that causes objects or documents which are similar to have a high probability of colliding in a hash bucket. LSH consists of three phases: pre processing where data is mapped using different distance measures, hash generation where the hash tables are constructed, and similarity search, where the hash tables are used to identify similar items. Entire data is placed in n buckets such that similar items are

re-

710

placed in same bucket. The detailed operations of LSH are illustrated in Fig. 9. 5.2.1. Applications:

Some of the applications which require identification of similar items in-

715

lP

clude similarity in ranking of a product by two users in recommender systems, finding near duplicates corresponding to a particular query document in web documents, identifying similar type of truncations in databases, etc. In recent years, LSH has been used for many applications which require fast computa-

urn a

tional process [109, 110]; in pattern matching which include video identification [111]. Latest area of LSH applications use modified hashing techniques for faster 720

computational process. In mobile services, LSH is used for detecting clones in Android applications [112]. Bertine et al. [113] have used LSH for assembling large genomes with single-molecule sequence in bio-informatics. Naderi et al. have proposed the usage of Locality Sensitive Hashing (LSH) for Malware Signature Generation. It clusters various malicious programs to reduce the number of signatures significantly [114]. With the tremendous increase in sharing of on-

Jo

725

line video data, LSH has also been applied for deduplication of videos by Li et al. [115].

49

Journal Pre-proof

6. Discussion

730

pro of

In today’s world, data is originating from heterogeneous sources and current real world databases are severely susceptible to inconsistent, incomplete and noisy data [93]. In order to support data applications in different domains, data processing must be efficient and automated as much as possible. With an exponential increase in data, extraction of useful information from massive data, particularly for analytics is a daunting task [1]. Some of the applications 735

which need special attention include heavy hitters in data streams, frequency query for all items in the set, estimate the cardinality of massive dataset, find

re-

similar items in huge pool of items, membership query, etc. The main challenge is to store massive data in memory and then index all items for future reference. While dealing with Big data, especially when incoming data is continuous, al740

gorithm need to answer the query in one pass only.

lP

This paper discusses various application areas where probabilistic data structures help in reducing the space and time complexity to a great extent, especially for massive data sets.

Various variants of PDS are discussed and explained. Because of simplicity in design and adaptive nature, BF has been successfully used in a large number

urn a

745

of application domains. Variants discusses in Table I explain the modifications proposed in BF which make it a successful candidate for applications in different domains. It has been observed that recently the focus has shifted to BF which deals with streaming data, i.e., to those belonging to dynamic or ageing BF 750

category.

QF is an another cache friendly PDS used for membership query. QF has

Jo

major advantages over BF in terms of memory and computational time. Use of QF is beneficial when fast insertion and querying is required from the data stored in secondary memory. Further, merging two QFs without any change in

755

accuracy is an added advantage. Count-min sketch is motivated from counting BF concept to reduce error

in observations where number of BFs are used in parallel and minimum val-

50

Journal Pre-proof

ues from the counters of CMS are considered observed as final output. CMS

760

pro of

is most optimal option present in the counting algorithm group for problems like frequency query, heavy hitters, top-k query, etc. While working on skewed distributed data, CMS faces some problems like inefficiency in tracking heavy hitters, space wastage by not using all counters, etc. and such issues need a thorough consideration.

Hyperloglog and hyperloglog++ counters are used to determine cardinality 765

of a huge data set based on bit pattern observable principle and use of stochastic average and harmonic mean. Major advantage of HLL is that very small

re-

memory is used and error rate is significantly low and can be reduced further by using high bit hash functions.

Locality sensitive hashing helps to solve the approximate or exact Near 770

Neighbor Search in high dimensional spaces in sub-linear search time. Initially,

lP

all the data is mapped to low dimensional space and then hash based similarity measures are used to find closest cluster for queried item. There are few issues in LSH which need improvement in LSH. Preprocessing in LSH amplify the error rate leading to increase in the computational overhead many folds. Dy775

namic changes in the data sets are difficult to incorporate in LSH as it leads to

urn a

computational overhead of redoing all preprocessing work. Since hashing used in LSH is independent of the nature of data hashing bias may be observed in some cases.

Table 5 summaries the important features of all the PDS covered in this paper.

Jo

780

51

False Negatives

Element Count

Similarity search

Cardinality Estimate

Time Complexity

Space Complexity

Computational Cost

8.

9.

10.

11.

12.

13.

Retrieval of original data set

5.

7.

Merging

4.

False Positives

Querying

3.

6.

Deletion

52

re-

LOW

LOW

LOW

M EDIU M

M EDIU M

M EDIU M

LOW

M EDIU M

× LOW

?

×

?

×

×

× √

× √

?

× √

?

×











Count Min Sketch

lP √



Quotient Filter

×

?

× √

?

? √



Bloom Filter

Table 5: Comparative analysis of all PDS studied



× √

M EDIU M

LOW

M ED

× √

×

×

HIGH

M EDIU M

HIGH

×

× √

pro of

×

×

×

×

×

× √



Locality Sensitive Hashing

×



Hyper Log Log

? indicates that some variants of PDS may support the particular feature

```

2.

```

Hashing

Parameters↓

1.

S.No.

PDS →

urn a

``` ```

Jo

Journal Pre-proof

Journal Pre-proof

7. Conclusion and Future scope

pro of

This paper provides a comprehensive view of various prevalent PDS which can be used for storage, retrieval and mining of massive data sets. The data structures discussed in the paper can be used to store bulk data in minimum 785

space, find cardinality of data sets, identify similar data sets in Big data and find the frequency of the elements in massive data. All the PDS are supported with their mathematical proofs i.e. mathematical analysis of BF and QF is provided in section 2.1,2.2 respectively, analysis of CMS is provided by section 3.1, section 4.1 and Appendix 1 explains about HLL, and Appendix 2 discusses the details of LSH. Application areas have been discussed at the end of every PDS in the

re-

790

entire manuscript, indicating the domains where they have been successfully implemented. It has been experimentally proved that complexity of PDS is far better than the deterministic ones for various operations such as insert, delete,

795

lP

traversal, search along with other statistical queries. Recent developments in PDS and the aptness of the Smart City Realm for IoT and Big Data Applications have opened a plethora of research opportunities for the industry and academia. In the present era of IoT where sensors, social media, etc. are sending petabytes

urn a

of data per minute, the major challenges are do provide generic platforms for in stream data analytics especially when the volume, variety and velocity of 800

data is not known apriori. Although PDS has been proposed in literature for massive data handling in IoT and Smart Cities but till there are lot of domains which need to be catered which include smart health, smart vehicular network management which includes smart parking, smart environment management etc. Named Data Network (NDN) is another domain where PDS can be used in Forward Interest Table (FIB) Lookup to enhance routing scheme and Pending

Jo

805

Interest Table (PIT) for duplicate detection. PDS can be used in Bioinformatics since storing and pattern matching of k-mers and DNA sequences can be done efficiently through LSH in combination with Bloom Filter. Few researchers are focusing on efficient utilization of PDS in crypto currency and privacy preserving

810

especially in location aware applications and this needs further exploration. The

53

Journal Pre-proof

Table A.6: PDS Implementation Resources PDS

URL

1.

BF

https://github.com/jaybaird/python-bloomfilter

Description

2.

BF

https://github.com/seomoz/pyreBloom

3.

QF

https://github.com/vedantk/quotient-filter

4.

QF

https://github.com/bucaojit/ QuotientFilter

Quotient filter implementation in Java

5.

CMS

https://github.com/rafacarrascosa/ countminsketch

CountMinSketch is a minimalistic Count-min Sketch

6.

HLL

https://github.com/prasanthj/hyperloglog

7.

LSH

https://github.com/ekzhu/datasketch

8.

LSH

https://github.com/simonemainardi/LSHash

9.

LSH

https://github.com/go2starr/lshhdc

pro of

S.No.

pybloom includes Scalable Bloom Filter’s implementation

pyreBloom provides Redis backed Bloom Filter using GETBIT and SETBIT

Quotient filter in-memory implementation written in C

in pure Python

API support for specifying hashcode directly to compute cardinality estimate

Datasketch gives you probabilistic data structures that can process vary large amount of data LSHash is a fast Python implementation of locality

re-

sensitive hashing with persistence support

LSHHDC: Locality-Sensitive Hashing based High Dimensional Clustering

lP

variants of the PDS along with the applications can serve as a initial benchmark for readers who want to pursue their research in this area. Considering the exponential increase in data and the application domains it can be concluded that PDS can be used for large scale applications in various engineering fields.

APPENDIX

urn a

815

Appendix A. Implementation Resources In table A.6, some useful resources are described related to each PDS we have discussed so far.

Appendix B. (Hyper Log Log) Theorem 1: Let the algorithm HYPERLOGLOG be applied to an ideal multi-

Jo 820

set of (unknown) cardinality n, using m ≥ 3 registers, and let E be the resulting cardinality estimate. Proof: Here is the intuition underlying the algorithm. Let n be the unknown n cardinality of M. Each substream will comprise approximately ( m ) elements.

54

Journal Pre-proof

825

n Then, its Max-parameter should be close to log2 ( m ). The harmonic mean of n m.

An ideal multiset

pro of

the quantities 2M ax is then likely to be of the order of

of cardinality n is a sequence obtained by arbitrary replications and permutations applied to n uniform identically distributed random variables over the real interval [0:1]. 830

Note that the number of distinct elements of such an ideal multiset equals n with probability 1. Henceforth let Eˆn and Vˆn be the expectation and variance operators under this model.

re-

• The estimate E is asymptotically almost unbiased in the sense that 1 ˆ En (E)n→∞ = 1 + δ1 (n) + O(1) n where |δ1 (n)| < 5 × 10−5 as soon as m ≥ 16

835

q

Vˆn (E)

βm = √ + δ2 (n) + O(1) m where |δ2 (n)| < 5 × 10−4

n→∞

urn a

1 n

1 n

(B.2) (B.3)

q Vˆn (E), where n → ∞

lP

• The standard error defined as

(B.1)

as soon as m ≥ 16

(B.4) (B.5) (B.6)

the constants βm being bounded, with β16 = 1.106, β32 = 1.070, β64 = 1.054,

terms the typical error to be observed (in a mean quadratic sense). The func-

Jo

840

β128 = 1.046, and p β∞ = (3log2) − 1 = 1.03896. The standard error measures in relative

tions δ1 (n); δ2 (n) represent oscillating functions of a tiny amplitude, which

are computable, and whose effect could in theory be at least partly compensated—they can anyhow be safely neglected for all practical purposes. From Theorem 1 main conclusions to the effect that the relative accuracy of

845

hyperloglog is numerically close to

β∞ m .

55

The algorithm needs to maintain a

Journal Pre-proof

collection of registers, each of which is at most log2 log2 (N ) + O(1) bits, when

pro of

cardinalities ≤ N need to be estimated. As a consequence, using m = 2048,

hashing on 32 bits, cardinalities till values over N = 109 can be estimated with a typical accuracy of 2% using 1.5kB of storage.

850

Appendix C. (LSH Families)

Appendix C.1. LSH with Hamming distance

LSH on binary string was proposed by Indyk and Motwani [116], where a data set
re-

p, q ∈
lP

represented as binary string or U nary(p) function is used for replacing each coordinate of pi with equivalent binary string of γ bits. Hash functions for hamming space are constructed by selecting k bits randomly from the binary string b(.) of an element. ` hash functions are calculated,

urn a

given by:

∀`i=1 Hi ← Randomk (b(.))

860

Each hash function returns k random bits from the binary string b of an element of γ bits. For two items (p and q)∈
Jo

P r[hi (p) = hi (q)] = γ − ||p − q||h1 = (1 −

865

P r[hi (p) 6= hi (q)] = γ − ||p − q||h2 = (1 −

||p−q||h1 γ ||p−q||h2 γ

) )

This problem can be converted into (r1 , r2 , P1 , P2 ) problem. LSH with hamming   ||p−q||h1 ||p−q||h2 distance is ||p − q||h1 , ||p − q||h2 , (1 − ), (1 − ) − sensitive. To γ γ

find (R,c)-nearest neighbor to the query element z, all hash functions are com-

puted for z, i.e.,

` i=1 hi (z)

←`i=1 Hi (z). Instead of comparing each binary string 56

Journal Pre-proof

870

with other, only hash values are compared and element having most similar

pro of

hash values are considered as nearest neighbor. Appendix C.2. LSH with Jacard Similarity

LSH is jaccard similarity [94] is computed with the help of Min-hash. The Signature Matrix(SM) act as a input. If m1 and m2 are the Min-hash based 875

distance between two rows of signature matrix then probability that both rows will appear in same band after applying banding technique will be (1 − m1 ) and (1 − m2 ) corresponding. After banding techniques similar items are grouped in same buckets with very high probability. According to definition of LSH family,

880

re-

this is a (m1 , m2 , (1 − m1 ), (1 − m2 ))− sensitive LSH family. Appendix C.2.1. LSH with Euclidean Distance(Edx,y )

LSH on points distributed in d-dimensional
lP

el. [117]. Let ai ∈
Edai ,aj

v u d uX a  xal i − xl j =t l=1

urn a

For applying LSH on given points in ls space, first hash function (hi ) is generated by a random line (Li ) in a given plane. Li is divided into buckets of equal size (s) and orthogonal projection from the given points to the line is drawn. If points are close enough they lie in same bucket on the line Li . Let 885

ai , aj ∈
s 2

then there exists a probability that two points are in the

Jo

same bucket, i.e., P1 =

1 2

and if Edai ,aj ≥ 2s then probability that both points

resides in same bucket is dependent on angle between them, if θ lies between

890

60 < θ < 90, probability P2 = 31 . Hence, the family of points in d-dimensional space
orthogonal projection on a random line with intervals of size s is a ( 2s , 2s, 12 , 13 )sensitive family of hash functions. 57

Journal Pre-proof

Appendix C.3. LSH with Cosine Distance(θ)

pro of

LSH on vectors in d-dimensional
as a distance measure [118]. Let vi ∈
vi .vj |vi ||vj |

Let vi , vj ∈
re-

a hyperplane which is selected randomly so that angle between rvg and vi , vj varies, so dot product DPi = rvg .vi and DPj rvg .vj varies. If hyperplane is ran900

dom such that rvg lies between vi and vj then DPi and DPj have same signs

lP

other wise they have different signs.

P r(Sign(rvg .vi ) = Sign(rvg .vj )) =

θ 180

With cosine distance θ1 and θ2 , the pair of vectors in d-dimensional space vi , vj ∈
θ1 180 ), (1



θ2 180 )-

905

urn a

sensitive family of hash functions.

Here only four LSH families, which are normally found in the literature, have been discussed but they can always be further extended.

References

[1] S. Garc´ıa, S. Ram´ırez-Gallego, J. Luengo, J. M. Ben´ıtez, and F. Herrera, “Big data preprocessing: methods and prospects,” Big Data Analytics, vol. 1, no. 1, p. 9, 2016.

Jo 910

[2] L. Rutkowski, M. Jaworski, and P. Duda, “Basic concepts of data stream mining,” in Stream Data Mining: Algorithms and Their Probabilistic Properties.

Springer, 2020, pp. 13–33.

58

Journal Pre-proof

[3] C. Srinivasan, B. Rajesh, P. Saikalyan, K. Premsagar, and E. S. Yadav, “A review on the different types of internet of things (iot),” Journal of

pro of

915

Advanced Research in Dynamical and Control Systems, vol. 11, no. 1, pp. 154–158, 2019.

[4] M. P. Singh, M. A. Hoque, and S. Tarkoma, “Analysis of systems to process massive data stream,” CoRR, 2016. 920

[5] J. Bi and C. Zhang, “An empirical comparison on state-of-the-art multiclass imbalance learning algorithms and a new diversified ensemble learn-

re-

ing scheme,” Knowledge-Based Systems, vol. 158, pp. 81–93, 2018.

[6] W. Gan, J. C.-W. Lin, H.-C. Chao, H. Fujita, and P. S. Yu, “Correlated utility-based pattern mining,” Information Sciences, vol. 504, pp. 470 925

– 486, 2019. [Online]. Available: http://www.sciencedirect.com/science/

lP

article/pii/S0020025519306139

[7] A. Gakhov, Probabilistic Data Structures and Algorithms for Big Data Applications. [8] I. Katsov,

“Probabilistic data structures for web analytics and

data

mining,”

line].

Available:

2012,

urn a

930

BoD–Books on Demand, 2019.

[Accessed

Online:

May

2016].

[On-

https://highlyscalable.wordpress.com/2012/05/01/

probabilistic-structures-web-analytics-data-mining/

[9] B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Commun. ACM, vol. 13, no. 7, pp. 422–426, 1970.

935

[10] S. Tarkoma, C. E. Rothenberg, and E. Lagerspetz, “Theory and practice

Jo

of bloom filters for distributed systems,” IEEE Communications Surveys Tutorials, vol. 14, no. 1, pp. 131–155, 2012.

[11] A. Kirsch and M. Mitzenmacher, “Distance-sensitive bloom filters.” in Proceedings of the Meeting on Algorithm Engineering & Expermiments,

940

vol. 6.

Philadelphia, PA, USA: SIAM, 2006, pp. 41–50.

59

Journal Pre-proof

[12] J. Bruck, J. Gao, and A. Jiang, “Weighted bloom filter,” in IEEE InterIEEE, 2006.

pro of

national Symposium on Information Theory.

[13] L. Fan, P. Cao, J. Almeida, and A. Z. Broder, “Summary cache: A scalable wide-area web cache sharing protocol,” IEEE/ACM Trans. Netw., vol. 8, 945

no. 3, pp. 281–293, Jun. 2000.

[14] F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese, “An improved construction for counting bloom filters,” in Proceedings of the 14th Conference on Annual European Symposium, ser. ESA’06, vol. 14.

950

re-

London, UK: Springer-Verlag, 2006, pp. 684–695.

[15] D. Guo, J. Wu, H. Chen, Y. Yuan, and X. Luo, “The dynamic bloom filters,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 1, pp. 120–133, 2010.

lP

[16] P. S. Almeida, C. Baquero, N. Pregui¸ca, and D. Hutchison, “Scalable bloom filters,” Inf. Process. Lett., vol. 101, no. 6, pp. 255–261, Mar. 2007. 955

[17] F. Deng and D. Rafiei, “Approximately detecting duplicates for streaming data using stable bloom filters,” in Proceedings of the ACM SIGMOD

urn a

International Conference on Management of Data, ser. SIGMOD’06. New York, USA: ACM, 2006, pp. 25–36.

[18] A. Kirsch and M. Mitzenmacher, “Less hashing, same performance: Build960

ing a better bloom filter,” Random Struct. Algorithms, vol. 33, no. 2, pp.

187–218, Sep. 2008.

[19] S. Geravand and M. Ahmadi, “Bloom filter applications in network secu-

Jo

rity: A state-of-the-art survey,” Computer Networks, vol. 57, no. 18, pp. 4047–4064, 2013.

965

[20] K. W. Choi, D. T. Wiriaatmadja, and E. Hossain, “Discovering mobile applications in cellular device-to-device communications: Hash function and bloom filter-based approach,” IEEE Transactions on Mobile Computing, vol. 15, no. 2, pp. 336–349, 2016. 60

Journal Pre-proof

[21] K. Verma and H. Hasbullah, “Bloom-filter based ip-chock detection scheme for denial of service attacks in vanet,” Security and Communi-

pro of

970

cation Networks, vol. 8, no. 5, pp. 864–878, 2015.

[22] W. Song, B. Wang, Q. Wang, Z. Peng, W. Lou, and Y. Cui, “A privacypreserved full-text retrieval algorithm over encrypted data for cloud storage applications,” Journal of Parallel and Distributed Computing, vol. 99, 975

pp. 14 – 27, 2017.

[23] B. Groza and P.-S. Murvay, “Efficient intrusion detection with bloom fil-

re-

tering in controller area networks,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 4, pp. 1037–1051, 2019. [24] K. Cheng, “Hot spot tracking by time-decaying bloom filters and reser980

voir sampling,” in International Conference on Advanced Information NetSpringer, 2019, pp. 1147–1156.

lP

working and Applications.

[25] M. Najam, R. U. Rasool, H. F. Ahmad, U. Ashraf, and A. W. Malik, “Pattern matching for dna sequencing data using multiple bloom filters,” BioMed Research International, vol. 2019, 2019. [26] Quora, “What are the best applications of bloom filters?”

urn a

985

2014,

[Accessed Online: Feb 2017]. [Online]. Available: https://www.quora. com/What-are-the-best-applications-of-Bloom-filters

[27] A. Singh, S. Garg, K. Kaur, S. Batra, N. Kumar, and K.-K. R. Choo, “Fuzzy-folded bloom filter-as-a-service for big data storage on cloud,” IEEE Transactions on Industrial Informatics, 2018.

[28] P. Liu, H. Wang, S. Gao, T. Yang, L. Zou, L. Uden, and X. Li, “Id

Jo

990

bloom filter: Achieving faster multi-set membership query in network applications,” in 2018 IEEE International Conference on Communications (ICC).

IEEE, 2018, pp. 1–6.

61

Journal Pre-proof

995

[29] J. Lu, Y. Wan, Y. Li, C. Zhang, H. Dai, Y. Wang, G. Zhang, and B. Liu,

pro of

“Ultra-fast bloom filters using simd techniques,” IEEE Transactions on Parallel and Distributed Systems, vol. 30, no. 4, pp. 953–964, 2019.

[30] R. Patgiri, S. Nayak, and S. K. Borgohain, “rdbf: A r-dimensional bloom filter for massive scale membership query,” Journal of Network and Com1000

puter Applications, 2019.

[31] Z. Sun, S. Gao, B. Liu, Y. Wang, T. Yang, and B. Cui, “Magic cube bloom filter: Answering membership queries for multiple sets,” in 2019 IEEE

IEEE, 2019, pp. 1–8. 1005

re-

International Conference on Big Data and Smart Computing (BigComp).

[32] M. Mitzenmacher, “Compressed bloom filters,” IEEE/ACM Transactions

lP

on Networking, vol. 10, no. 5, pp. 604–612, 2002. [33] S. Cohen and Y. Matias, “Spectral bloom filters,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, ser. SIGMOD’03. 1010

New York, USA: ACM, 2003, pp. 241–252.

[34] A. Kumar, J. J. Xu, L. Li, and J. Wang, “Space-code bloom filter for

urn a

efficient traffic flow measurement,” in Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement, ser. IMC’03.

New York,

USA: ACM, 2003, pp. 167–172.

[35] E.-J. Goh, “Secure indexes.” IACR Cryptology ePrint Archive, vol. 2003, 1015

pp. 2–16, 2003.

[36] K. Shanmugasundaram, H. Br¨ onnimann, and N. Memon, “Payload attri-

Jo

bution via hierarchical bloom filters,” in Proceedings of the 11th ACM Conference on Computer and Communications Security, ser. CCS’04.

New York, USA: ACM, 2004, pp. 31–41.

1020

[37] B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal, “The bloomier filter: An efficient data structure for static support lookup tables,” in Proceedings

62

Journal Pre-proof

of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms,

pro of

ser. SODA’04. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2004, pp. 30–39. 1025

[38] M.-Z. Xiao, Y.-F. Dai, and X.-M. Li, “Split bloom filter,” Tien Tzu Hsueh Pao/Acta Electronica Sinica, vol. 32, pp. 241–245, 2004.

[39] F. Chang, W. chang Feng, and K. Li, “Approximate caches for packet classification,” in Twenty-third AnnualJoint Conference of the IEEE Computer and Communications Societies (INFOCOM’04), vol. 4, March 2004, pp. 2196–2207.

re-

1030

[40] Y. Lu, B. Prabhakar, and F. Bonomi, “Bloom filters: Design innovations and novel applications,” in In Proc. of the Forty-Third Annual Allerton

lP

Conference, 2005.

[41] B. Donnet, B. Baynat, and T. Friedman, “Retouched bloom filters: Al1035

lowing networked applications to trade off selected false positives against false negatives,” in Proceedings of the ACM CoNEXT Conference, ser. CoNEXT’06.

J. Gao,

and A. A. Jiang,

urn a

[42] J. Bruck,

New York, USA: ACM, 2006, pp. 13:1–13:12. “Adaptive bloom filter,”

California Institute of Technology, 2006. [Online]. Available:

1040

http:

//authors.library.caltech.edu/26103/1/etr072.pdf

[43] M. Zhong, P. Lu, K. Shen, and J. Seiferas, “Optimizing data popularity conscious bloom filters,” in Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing, ser. PODC’08. New York,

Jo

NY, USA: ACM, 2008, pp. 355–364.

1045

[44] M. Ahmadi and S. Wong, “A memory-optimized bloom filter using an additional hashing function,” in IEEE Global Telecommunications Con-

ference (GLOBECOM’08), Nov 2008, pp. 1–5.

63

Journal Pre-proof

[45] A. Goel and P. Gupta, “Small subset queries and bloom filters using

1050

pro of

ternary associative memories, with applications,” SIGMETRICS Perform. Eval. Rev., vol. 38, no. 1, pp. 143–154, Jun. 2010.

[46] C. E. Rothenberg, C. A. B. Macapuna, F. L. Verdi, and M. F. Magalhaes, “The deletable bloom filter: a new member of the bloom family,” IEEE Communications Letters, vol. 14, no. 6, pp. 557–559, June 2010.

[47] R. P. Laufer, P. B. Velloso, and O. C. M. B. Duarte, “A generalized bloom 1055

filter to secure distributed network applications,” Comput. Netw., vol. 55,

re-

no. 8, pp. 1804–1819, Jun. 2011.

[48] J. L. Dautrich, Jr. and C. V. Ravishankar, “Inferential time-decaying bloom filters,” in Proceedings of the 16th International Conference on Extending Database Technology, ser. EDBT’13. 2013, pp. 239–250.

lP

1060

New York, USA: ACM,

[49] F. Concas, P. Xu, M. A. Hoque, J. Lu, and S. Tarkoma, “Multiple set matching and pre-filtering with bloom multifilters.” [50] M. Mitzenmacher, “A model for learned bloom filters and related struc-

1065

urn a

tures,” arXiv preprint arXiv:1802.00884, 2018. [51] A. Singh and S. Batra, “Streamed data analysis using adaptable bloom filter,” Computing and Informatics, vol. 37, no. 3, pp. 693–716, 2018.

[52] Y. Hua, B. Xiao, B. Veeravalli, and D. Feng, “Locality-sensitive bloom filter for approximate membership query,” IEEE Transactions on Com-

puters, vol. 61, no. 6, pp. 817–830, 2012.

[53] S. Negi, A. Dubey, A. Bagchi, M. Yadav, N. Yadav, and J. Raj, “Dynamic

Jo

1070

partition bloom filters: A bounded false positive solution for dynamic set membership,” arXiv preprint arXiv:1901.06493, 2019.

[54] N. Mousavi and M. Tripunitara, “Constructing cascade bloom filters for efficient access enforcement,” Computers & Security, vol. 81, pp. 1–14,

1075

2019. 64

Journal Pre-proof

[55] M. A. Bender, M. Farach-Colton, R. Johnson, R. Kraner, B. C. Kuszmaul,

pro of

D. Medjedovic, P. Montes, P. Shetty, R. P. Spillane, and E. Zadok, “Don’t thrash: How to cache your hash on flash,” Proc. VLDB Endow., vol. 5, no. 11, pp. 1627–1637, Jul. 2012. 1080

[56] D. E. Knuth, The Art of Computer Programming: Sorting and Searching. Addison-Wesley, 1998.

[57] M. Al-hisnawi and M. Ahmadi, “Deep packet inspection using quotient filter,” IEEE Communications Letters, vol. 20, no. 11, pp. 2217–2220, Nov

1085

re-

2016.

[58] S. Dutta, A. Narang, and S. K. Bera, “Streaming quotient filter: A near optimal approximate duplicate detection approach for data streams,”

lP

Proc. VLDB Endow., vol. 6, no. 8, pp. 589–600, Jun. 2013. [59] P. Goudarzi, H. T. Malazi, and M. Ahmadi, “Khorramshahr: A scalable peer to peer architecture for port warehouse management system,” Jour1090

nal of Network and Computer Applications, vol. 76, pp. 49 – 59, 2016. [60] S. Garg, A. Singh, K. Kaur, G. S. Aujla, S. Batra, N. Kumar, and M. Obai-

urn a

dat, “Edge computing-based security framework for big data analytics in vanets,” IEEE Network, vol. 33, no. 2, pp. 72–81, 2019.

[61] S. Garg, A. Singh, K. Kaur, S. Batra, N. Kumar, and M. S. Obaidat, 1095

“Edge-based content delivery for providing qoe in wireless networks using quotient filter,” in 2018 IEEE International Conference on Communications (ICC).

IEEE, 2018, pp. 1–6.

Jo

[62] R. Shubbar and M. Ahmadi, “Efficient name matching based on a fast two-dimensional filter in named data networking,” International Journal

1100

of Parallel, Emergent and Distributed Systems, vol. 34, no. 2, pp. 203–221,

2019.

65

Journal Pre-proof

[63] R. S. Boyer and J. S. Moore, MJRTY—A Fast Majority Vote Algorithm.

pro of

Dordrecht: Springer Netherlands, 1991, pp. 105–117, doi:10.1007/978-94011-3488-0 5. 1105

[64] G. S. Manku and R. Motwani, “Approximate frequency counts over data streams,” in Proceedings of the 28th international conference on Very Large Data Bases.

VLDB Endowment, 2002, pp. 346–357.

[65] A. Metwally, D. Agrawal, and A. El Abbadi, “Efficient computation of frequent and top-k elements in data streams,” in International Conference on Database Theory, ser. ICDT’05. Berlin, Heidelberg: Springer-Verlag,

re-

1110

2005, pp. 398–412, doi:10.1007/978-3-540-30570-5 27. [66] G. Cormode and S. Muthukrishnan, “An improved data stream summary: the count-min sketch and its applications,” Journal of Algorithms, vol. 55,

1115

lP

no. 1, pp. 58–75, 2005.

[67] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent items in data streams,” Automata, languages and programming, pp. 784–784, 2002. Wu,

“Count

urn a

[68] S.

min

sketch

and

its

applications,”

http://grigory.us/files/cm-sketch.pdf, December 2014, [Accessed Online:

1120

April 2016].

[69] S.

Mewtoo.

(2010)

Count

https://sites.google.com/site/countminsketch/.

min [Accessed

sketch. on:

Dec

2016].

Jo

[70] P. P. Talukdar and W. W. Cohen, “Scaling graph-based semi supervised 1125

learning to large number of labels using count-min sketch.” in AISTATS, 2014, pp. 940–947, doi:https://arxiv.org/abs/1310.2959.

[71] X. D. Hoang and H. K. Pham, “A review on hot-ip finding methods and its application in early ddos target detection,” Future Internet, vol. 8, no. 4, p. 52, 2016. 66

Journal Pre-proof

1130

[72] G. Pitel, G. Fouquier, E. Marchand, and A. Mouhamadsultane, “Count-

pro of

min tree sketch: Approximate counting for nlp,” in 2nd International Symposium on Web Algorithms (ISWAG’2016), Deauville, France, vol. 1, 2016.

[73] N. Bonelli, C. Callegari, and G. Procissi, “A probabilistic counting frame1135

work for distributed measurements,” IEEE Access, vol. 7, pp. 22 644– 22 659, 2019.

[74] X. Zhu, G. Wu, H. Zhang, S. Wang, and B. Ma, “Dynamic count-min

re-

sketch for analytical queries over continuous data streams,” in 2018 IEEE 25th International Conference on High Performance Computing (HiPC). 1140

IEEE, 2018, pp. 225–234.

[75] J. S. Moore, “A fast majority vote algorithm,” Automated Reasoning:

lP

Essays in Honor of Woody Bledsoe,

1981. [Online]. Available:

ftp://www.cs.utexas.edu/pub/boyer/ics-reports/cmp32.pdf [76] S. Matusevych, A. J. Smola, and A. Ahmed, “Hokusai-sketching streams 1145

in real time,” in Proceedings of the Twenty-Eighth Conference on Uncer-

urn a

tainty in Artificial Intelligence, ser. UAI’12. Arlington, Virginia, United States: AUAI Press, 2012, pp. 594–603.

[77] M. Durand and P. Flajolet, “Loglog counting of large cardinalities,” in In ESA, 2003, pp. 605–617.

1150

[78] P. Flajolet, E. Fusy, and O. Gandouet, “Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm,” in Proceedings of The

Jo

International Conference On Analysis Of Algorithms (AOFA’07), 2007. [Online]. Available:

http://cscubs.cs.uni-bonn.de/2016/proceedings/

paper-03.pdf

1155

[79] E. Fusy and F. Giroire, “Estimating the number of active flows in a data stream over a sliding window,” in Proceedings of the Meeting on Analytic

67

Journal Pre-proof

Algorithmics and Combinatorics, ser. ANALCO’07.

Philadelphia, PA,

pro of

USA: Society for Industrial and Applied Mathematics, 2007, pp. 223–231. [80] P. Flajolet and G. N. Martin, “Probabilistic counting algorithms for data 1160

base applications,” Journal of Computer and System Sciences, vol. 31, no. 2, pp. 182 – 209, 1985. [81] T.

Karnezos,

“HLL

talk

at

https://research.neustar.biz/2014/09/23/hll-talk-at-sfpug/,

SFPUG,” Septem-

ber 2014, [Accessed Online: Jan 2017].

[82] W. Wu, J. F. Naughton, and H. Singh, “Sampling-based query re-

re-

1165

optimization,” in Proceedings of the International Conference on Management of Data, ser. SIGMOD’16.

lP

pp. 1721–1736.

New York, NY, USA: ACM, 2016,

[83] E. Georganas, A. Bulu¸c, J. Chapman, L. Oliker, D. Rokhsar, and 1170

K. Yelick, “Parallel de bruijn graph construction and traversal for de novo genome assembly,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press,

urn a

2014, pp. 437–448.

[84] G. Drakopoulos, S. Kontopoulos, and C. Makris, “Eventually consistent 1175

cardinality estimation with applications in biodata mining,” in Proceedings

of the 31st Annual ACM Symposium on Applied Computing. ACM, 2016, pp. 941–944.

[85] Y. Zhao, S. Guo, and Y. Yang, “Hermes: An optimization of hyperloglog

Jo

counting in real-time data processing,” in International Joint Conference

1180

on Neural Networks (IJCNN).

IEEE, 2016, pp. 1890–1895.

[86] S. Dietzel, A. Peter, and F. Kargl, “Secure cluster-based in-network information aggregation for vehicular networks,” in IEEE 81st Vehicular Technology Conference (VTC Spring).

68

IEEE, 2015, pp. 1–5.

Journal Pre-proof

[87] G. Cormode, Streaming Methods in Data Analysis. International Publishing, 2015, pp. 3–6.

pro of

1185

Cham: Springer

[88] Z. Zhou and B. Hajek, “Per-flow cardinality estimation based on virtual loglog sketching,” in 2019 53rd Annual Conference on Information Sciences and Systems (CISS).

IEEE, 2019, pp. 1–6.

[89] D. N. Baker and B. Langmead, “Dashing: Fast and accurate genomic 1190

distances with hyperloglog,” BioRxiv, p. 501726, 2018.

[90] R. Morris, “Counting large numbers of events in small registers,” Com-

re-

mun. ACM, vol. 21, no. 10, pp. 840–842, 1978.

[91] M. Wegman, “Sample counting,” Private Communication, 1984. [92] S. Heule, M. Nunkesser, and A. Hall, “Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm,” in

lP

1195

Proceedings of the 16th International Conference on Extending Database Technology, ser. EDBT ’13. New York, NY, USA: ACM, 2013, pp. 683– 692, doi:10.1145/2452376.2452456.

1200

urn a

[93] A. Rajaraman and J. D. Ullman, Mining of Massive Datasets.

New

York, NY, USA: Cambridge University Press, 2011. [Online]. Available: http://infolab.stanford.edu/∼ullman/mmds/book.pdf

[94] A.

Al-Fuqaha,

hashing

locality

“Similarity

sensitive

analysis

hashing,”

and

distance

https://cs.wmich.edu/

fuqaha/summer14/cs6530/lectures/SimilarityAnalysis.pdf,

al[Ac-

cessed Online: March 2017].

Jo

1205

2014,

min-

[95] G. Shakhnarovich, P. Indyk, and T. Darrell, “Locality sensitive hashing,”

https://en.wikipedia.org/wiki/Locality-sensitive-hashing/,

2007,

[Accessed Online: Dec 2016].

[96] A. Broder, “On the resemblance and containment of documents,” in

1210

Proceedings of the Compression and Complexity of Sequences, ser. SE69

Journal Pre-proof

QUENCES ’97.

Washington, DC, USA: IEEE Computer Society, 1997,

pro of

pp. 21–29.

[97] M. Datar and S. Muthukrishnan, Estimating Rarity and Similarity over Data Stream Windows. 1215

Berlin, Heidelberg: Springer Berlin Heidelberg,

2002, pp. 323–335, doi:10.1007/3-540-45749-6 31.

[98] O. Chum, M. Perd’och, and J. Matas, “Geometric min-hashing: Finding a thick needle in a haystack,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09).

IEEE, 2009, pp. 17–24.

1220

re-

[99] S. Ioffe, “Improved consistent sampling, weighted minhash and l1 sketching,” in 10th International Conference on Data Mining (ICDM). IEEE, 2010, pp. 246–255, doi:10.1109/ICDM.2010.80.

lP

[100] A. Z. Broder and C. G. Nelson, “Method for determining the resemining the resemblance of documents,” May 2001, uS Patent 6,230,155. [101] B. D. Ondov, T. J. Treangen, P. Melsted, A. B. Mallonee, N. H. Bergman, 1225

S. Koren, and A. M. Phillippy, “Mash: fast genome and metagenome distance estimation using minhash,” Genome Biology, vol. 17, no. 1, p.

urn a

132, 2016.

[102] S. Thaiyalnayaki and J. Sasikala, “Indexing near-duplicate images in web search using minhash algorithm,” in International Conference on Process-

1230

ing of Materials, Minerals and Energy (PMME). Elsevier, 2016, pp. 1–7.

[103] S.-J. Lee and J.-K. Min, “An efficient large graph clustering technique

Jo

based on min-hash,” Journal of KIISE, vol. 43, no. 3, pp. 380–388, 2016.

[104] J. Drew, T. Moore, and M. Hahsler, “Polymorphic malware detection using sequence classification methods,” in Security and Privacy Workshops

1235

(SPW).

IEEE, 2016, pp. 81–87.

[105] S.-H. Lee, M.-U. Song, J.-K. Jung, and T.-M. Chung, “A study of malicious code classification system using minhash in network quarantine 70

Journal Pre-proof

using sdn,” in International Conference on Computer Science and its Ap-

1240

Springer, 2016, pp. 594–599.

pro of

plications.

[106] B. Rao and E. Zhu, “Searching web data using minhash lsh,” in Proceedings of the International Conference on Management of Data, ser. SIGMOD’16.

New York, NY, USA: ACM, 2016, pp. 2257–2258.

[107] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in Proceedings of the 25th International Conference on 1245

Very Large Data Bases, ser. VLDB’99. San Francisco, CA, USA: Morgan

re-

Kaufmann Publishers Inc., 1999, pp. 518–529.

[108] J. Wang, H. T. Shen, J. Song, and J. Ji, “Hashing for similarity search: A survey,” CoRR, vol. abs/1408.2927, 2014.

1250

lP

[109] F. Chierichetti and R. Kumar, “Lsh-preserving functions and their applications,” Journal of the ACM (JACM), vol. 62, no. 5, p. 33, 2015. [110] A. Becker, L. Ducas, N. Gama, and T. Laarhoven, “New directions in nearest neighbor searching with applications to lattice sieving,” in Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algo-

1255

SIAM, 2016, pp. 10–24.

urn a

rithms.

[111] Z. Kang, W. T. Ooi, and Q. Sun, “Hierarchical, non-uniform locality sensitive hashing and its application to video identification,” in IEEE International Conference on Multimedia and Expo (ICME’04), vol. 1. IEEE, 2004, pp. 743–746.

[112] C. Soh, H. B. K. Tan, Y. L. Arnatovich, and L. Wang, “Detecting clones in android applications through analyzing user interfaces,” in 2015 IEEE

Jo

1260

23rd International Conference on Program Comprehension, May 2015, pp. 163–173.

[113] K. Berlin, S. Koren, C.-S. Chin, J. P. Drake, J. M. Landolin, and A. M. Phillippy, “Assembling large genomes with single-molecule sequencing and

71

Journal Pre-proof

1265

locality-sensitive hashing,” Nature biotechnology, vol. 33, no. 6, pp. 623–

pro of

630, 2015.

[114] H. Naderi, P. Vinod, M. Conti, S. Parsa, and M. H. Alaeiyan, “Malware signature generation using locality sensitive hashing,” in International Conference on Security & Privacy. 1270

Springer, 2019, pp. 115–124.

[115] Y. Li, L. Hu, K. Xia, and J. Luo, “Fast distributed video deduplication via locality-sensitive hashing with similarity ranking,” EURASIP Journal on Image and Video Processing, vol. 2019, no. 1, p. 51, 2019.

re-

[116] P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” in Proceedings of the Thirtieth An1275

nual ACM Symposium on Theory of Computing, ser. STOC’98.

New

lP

York, USA: ACM, 1998, pp. 604–613.

[117] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in Proceedings of the Twentieth Annual Symposium on Computational Geometry, ser. SCG’4. 1280

ACM, 2004, pp. 253–262.

urn a

[118] M. S. Charikar, “Similarity estimation techniques from rounding algorithms,” in Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, ser. STOC’02.

Jo

380–388.

72

New York, USA: ACM, 2002, pp.