Information Processing Letters 114 (2014) 185–191
Contents lists available at ScienceDirect
Information Processing Letters www.elsevier.com/locate/ipl
Improving the performance of Invertible Bloom Lookup Tables Salvatore Pontarelli a,∗ , Pedro Reviriego b , Michael Mitzenmacher c,1 a b c
University of Rome “Tor Vergata”, Via del Politecnico 1, 00133, Rome, Italy Universidad Antonio de Nebrija, C/ Pirineos, 55, E-28040 Madrid, Spain Harvard University, 33 Oxford Street, Cambridge, MA 02138, USA
a r t i c l e
i n f o
Article history: Received 22 July 2013 Received in revised form 11 November 2013 Accepted 17 November 2013 Available online 12 December 2013 Communicated by M. Chrobak Keywords: Algorithms Bloom filters Hash Data structures
a b s t r a c t Invertible Bloom Lookup Tables (IBLTs) have been recently introduced as an extension of traditional Bloom filters. IBLTs store key-value pairs. Unlike traditional Bloom filters, IBLTs support both a lookup operation (given a key, return a value) and an operation that lists out all the key-value pairs stored. One issue with IBLTs is that there is a probability that a lookup operation will return “not found” for a key. In this paper, a technique to reduce this probability without affecting the storage requirement and only moderately increasing the search time is presented and evaluated. The results show that it can significantly reduce the probability of not returning a value that is actually stored in the IBLT. The overhead of the modified search procedure, compared to the standard IBLT search procedure, is small and has little impact on the average search time. © 2013 Elsevier B.V. All rights reserved.
1. Introduction Bloom Filters (BFs), originally developed by Burton Bloom [1], are a simple data structure that provides a representation of a set of elements that has found numerous applications in computing and networking. They are used to efficiently check set membership when a small probability of false positives is allowed. (See [2] for an introduction and several applications.) Recently, Invertible Bloom Lookup Tables (IBLTs), an extension of BFs that store keyvalue pairs and allow the recovery of the original data set, has been introduced [5].2 Potential applications of IBLTs include database reconciliation, traffic monitoring, and error correction in large data sets [3,4,6]. IBLTs will generally return the value associated with a key and a null value if queried for a key not in the set of keys stored, but it
*
Corresponding author. E-mail addresses:
[email protected] (S. Pontarelli),
[email protected] (P. Reviriego),
[email protected] (M. Mitzenmacher). 1 Mitzenmacher’s work was supported in part by NSF grants CNS1228598, IIS-0964473, and CCF-0915922. 2 Invertible Bloom Lookup Tables themselves generalize the Invertible Bloom Filters of Eppstein and Goodrich [3]. 0020-0190/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.ipl.2013.11.015
can with some probability also return a “not found” value for keys in the set and keys not in the set. We refer to returning “not found” as a failure. The failure probability can be reduced at the expense of increasing the storage requirements of the IBLT. Indeed, with the original IBLT approach, the space for answering lookup queries successfully is a clear bottleneck; the space required to reduce the “not found” probability to suitable practical values is much higher than the space to ensure the original data set can be recovered. In this paper we describe a method that substantially reduces the probability of “not found” return values for the lookup operation for IBLTs. The proposed method does not require additional structure to be added to the IBLT data structure. It merely reduces the probability of failure, improving the set of available tradeoffs between space and the success probability of lookups. The cost is slightly longer times for a lookup operation, as more cells may need to be examined than for the previous lookup approach. The rest of the paper is organized as follows. Section 2 reviews the IBLT data structure. In Section 3 the improved get procedure is presented and its performance is analyzed theoretically. In Section 4, the proposed modification is evaluated with simulations to show its effectiveness.
186
S. Pontarelli et al. / Information Processing Letters 114 (2014) 185–191
2. IBLT review We review the IBLT data structure, following [5]. An IBLT is a table with m cells that is used to store a set of key-value pairs (xi , y i ). For each cell, the following fields are defined:
• A keySum field that stores the sum of the keys xi that have been mapped to the cell.
• A valueSum field that stores the sum of the values y i that have been mapped to the cell.
• A count field that stores the number of entries that have been mapped to the cell. Note that sum can generally be taken to be XOR if keys can be represented as fixed-length bitstrings; see [5] for more discussion on this point. The following operations are defined over an IBLT:
• insert(x, y ) to add a key-value pair (x, y ) to the IBLT. • delete(x, y ) to remove a key-value pair (x, y ) from the IBLT.
• get(x) to return the value associated with a key from the IBLT. Return null if there is no value y associated with the key x. The operation may fail with constant probability, returning “not found”, in which case there may or may not be a value associated with the key x. • listEntries to retrieve and list all the key-value pairs stored in the IBLT. This operation may fail in probability inverse polynomial in the number of stored pairs. The standard IBLT construction uses k subtables and a set of k hash functions h1 , h2 , . . . , hk that map keys to values between 0 and m/k − 1. For example, to add a pair (x, y ), we compute the values h1 (x), h2 (x), . . . , hk (x), and we update the cells with positions h1 (x), m/k + h2 (x), . . . , (k − 1)m/k + hk (x) in the table by increasing the count field and adding x and y respectively to the fields keySum and valueSum. To remove a pair (x, y ) that we assume is stored in the IBLT, we similarly access these cells, and decrement the count field and subtract x and y respectively from the fields keySum and valueSum. The get operation also accesses the cells with positions h1 (x), m/k + h2 (x), . . . , (k − 1)m/k + hk (x). If at least one of the cells has count = 0 then the IBLT responds with a null value. Otherwise, the procedure tries to find a cell with a count of one. When one of these cells has count = 1 and x = keySum the value y can be retrieved as it corresponds to the valueSum field. When the count = 1 and x = keySum the IBLT responds with a null value, as the key x cannot then be in the IBLT. If none of those cells has a count of one, the get operation responds by failing and returning “not found”. This will occur with low probability when the IBLT is properly dimensioned. The listEntries operation that recovers all the entries stored in the IBLT is more complex, and not relevant to our work here. The main point is again that when the IBLT is properly dimensioned, the entries will be successfully listed with a probability close to one. See [5] for more details.
In some applications of IBLTs, such as database reconciliation or error correction, there may be errors in the contents of the IBLT, deletions of key-value pairs not in the system, or multiple values inserted for a key. In such cases, sometimes two additional fields are added to the cells: hashkeySum and hashvalueSum store sums of hashes of the keys and values summed in the keySum and valueSum fields. In such cases a count field may not be needed; the hashkeySum and hashvalueSum fields can be used to verify whether a single key is contained in a cell. We focus our attention on the version of the IBLT without these fields, although our technique (with additional care) can also be applied in this setting. 3. Improved get procedure 3.1. Analysis for stored keys The get operation for a key x that is in the IBLT fails if all the positions h1 (x), m/k + h2 (x), . . . , (k − 1)m/k + hk (x) have a count larger than one. This occurs with a probability that is similar to that of a false positive in a traditional Bloom filter. In what follows assume that we store n keyvalue pairs in the IBLT with m cells. If p i is the fraction of cells that have a count of i excluding the key x, then
P fail = (1 − p 0 )k .
(1)
Following a standard practice in analyzing hashing data structures, we will actually use p i to denote the asymptotic expected fraction of cells with a count of i henceforth. This is a suitable approximation, as the fraction of cells with count i is concentrated closely around p i for constant i and typical values of m and n (say in the thousands), which is all we shall be considering here. See [2] for further details. Similarly, not including x just changes the number of keys by 1, which asymptotically has no effect. For convenience, we will use equal signs in some equations, but the reader should understand that these are approximations to actual performance, ignoring o (1) variations and the fact that the results hold with high probability. To reduce the probability of failure, the idea is to exploit positions that have a count of two to obtain a successful get. When a cell that x has hashed to has count = 2, the keySum field stores the sum of two keys, x + x . Therefore x can be obtained by simply computing x = keySum − x. Then the positions h1 (x ), m/k + h2 (x ), . . . , (k − 1)m/k + hk (x ) are checked to see if one of them has a count of one. If that is the case, the value y associated with key x is recovered. Subsequently the value y associated with x is easily recovered by doing y = valueSum − y . This improved get procedure can be summarized as follows: 1. Perform a traditional get. 2. When step 1 fails, obtain the positions that x hashes to with a count of two. If there are no such positions, then we return “not found” for x. 3. For each position h i (x) + (i − 1)m/k with a count of 2 found in step 2, obtain the other key stored there by computing x = keySum − x.
S. Pontarelli et al. / Information Processing Letters 114 (2014) 185–191
4. Check that h i (x ) = h i (x) for each pair of x and i values found in step 3. If they are ever not equal, return null for x. Otherwise, continue. 5. For each key computed in step 3 perform a get. If any key returns with a null value, return null for x. If all keys fail, return “not found” for x. Otherwise, at least one get is successful, and we obtain a value y associated with the key x , and we continue. 6. Obtain the value associated with x by finding y = valueSum − y . We first consider the case where x is stored in the IBLT. In this case, any x found with x in a cell of count two is also stored in the IBLT, and hence the return value will either be “not found” (a failure) or the correct value. This improved get fails when one of the following situations occurs: 1. All the positions tested in the first get have a count field greater than two. 2. All the get operations for the other keys tested also fail. The first situation occurs with probability:
P fail1 = (1 − p 0 − p 1 )k .
(2)
The second depends on the number of positions with a count of two and can be approximated as:
P fail2 =
k k i =1
i
i
p 1 (1 − p 0 )k−1 (1 − p 0 − p 1 )k−i .
(3)
Here i is the number of positions in step 3 with count = 2, p 1 is the probability that a given position where x has hashed to has one other key x hashed to that position (giving a count of two), and (1 − p 0 )k−1 is the probability that the k − 1 other positions associated with x all have another key. The (1 − p 0 − p 1 )k−i term accounts for the k − i positions with a count of three or greater. Combining, we find
k = (1 − p 0 − p 1 ) + p 1 (1 − p 0 )k−1 .
nk −nk/m p1 ≈ . e m
1 − e −nk/m −
P fail ≈
+
nk −nk/m e m
k−1 nk −nk/m 1 − e −nk/m e m
k .
(4)
(5)
A further question is how much additional computation is required by our improved get operation. We measure this by considering the number of cells that must be examined, as the need to pull in randomly distributed cells from memory will generally be the practical bottleneck. In the worst case, this could require examining k2 total cells. However, when looking up the key x in the IBLT, we only check k − 1 additional cells for each x that appears in the same cell as x with a count of 2 for the cell. The expected number of additional cells to be examined is just:
k k i (k − 1)i p 1 (1 − p 0 − p 1 )k−i .
(6)
i
i =1
This is generally quite small. Moreover, we can do better by considering each x in the same cell as x with a count of 2 sequentially; this allows us to examine fewer cells when we can find the value for x early in the sequential check. (The closed form for the approximation of that value is left as a simple exercise.) We can apply this idea recursively to further reduce the failure probability at the expense of adding complexity to the improved get procedure. That is, we can recursively use the improved get instead of the traditional get in step 5 of the procedure previously described, for as many levels as desired. While in practice the gains from recursing are limited, as we show in Section 4, it is worth considering analytically. A formula to estimate the failure probability of a recursive get can be obtained as follows:
q0 = (1 − p 0 ); qi +1 = (1 − p 0 − p 1 ) + ( p 1 )(qi )k−1 ;
This can be much smaller than the failure probability for a key in the filter using the standard get operation, as will be seen in the results presented in Section 4. Note that p i , the probability a cell has i keys, is equal to the probability that i out of n keys hash to a cell when each has probability k/m of hashing there. It is well-known that this converges to a discrete Poisson distribution, and this approximation is highly accurate even for reasonably sized IBLTs. We use this to estimate the failure probability analytically. For our purposes, only the values of p 0 and p 1 are needed, and they are well-approximated as:
p 0 ≈ e −nk/m ,
Hence we have asymptotically and as a good approximation:
k
P fail,i = (1 − p 0 − p 1 ) + ( p 1 )(qi )k−1 .
P fail = P fail1 + P fail2
187
(7)
In this case, qi gives the probability that a position with a count value of two fails to recover the key when it is used recursively i times. The global probability of failure after i levels of recursion is given by P fail,i . Alternatively, one can find the fixed point
q = (1 − p 0 − p 1 ) + p 1 qk−1
(8)
to find the limit of the capabilities of the recursive get. Combining the above, we can state the following theorem for our modified get operations (which, for clarity, we distinctly refer to as the improved get and the recursive version). Theorem 1. Consider IBLTs in the asymptotic regime where n is growing and the ratio m/n is fixed at some constant value c. Recall k is the number of hash functions. For any constant k, the
188
S. Pontarelli et al. / Information Processing Letters 114 (2014) 185–191
probability that the improved get operation fails to return the value of a key in the IBLT is
1−e
−nk/m
nk − e −nk/m m
+
k−1 nk + e −nk/m 1 − e −nk/m m
k
z=
nk −nk/m 1 − e −nk/m − e m
+ o(1).
nk + e −nk/m qk−1 m
k ,
and q is the fixed point of
q = 1 − e −nk/m −
nk −nk/m e m
+
kp 2 m
(1 − p 0 − p 1 )k−1
k (9)
, (nk)2
For any constant k and for any constant > 0, the probability that the recursive get operation fails to return the value of a key in the IBLT can be made to be at most z + + o(1), where
P fail ≈ (1 − p 0 − p 1 − p 2 )
nk −nk/m k−1 q , e m
by using a large enough constant number of recursive rounds. 3.2. Analysis for keys not stored We have not yet considered what happens if one of our modified get operations is performed with a key not stored in the IBLT. We first note that the use of the improved get will not give an incorrect value, even if an item not inside the IBLT is queried. For suppose we query for an x not in the IBLT, and the position corresponding to h i (x) has a count of 2. We obtain a corresponding key x = keySum − x, but note that x is not present in the IBLT, because if x was in the IBLT, then x = keySum − x would be as well, contradicting our assumption regarding x. Hence x will not be verified when checking its k − 1 other hash locations, and we cannot return an incorrect value for x based on its ith hash value. Therefore, our improved get never returns an incorrect value. (The same holds for the recursive version.) The probability of failure (that is, returning “not found” instead of null) for a key x not in the IBLT under the original get operation is simply (1 − p 0 − p 1 )k ; that is, all cells must have at least 2 actual keys held in the IBLT. Under the new scheme, not only must all cells have at least 2 keys, but for each cell where there are 2 keys, when we consider the value x = keySum − x for that cell, several things must happen. First, we check that x hashes to the cell with x. This happens with probability k/m. (Note that we are assuming an oblivious setting; that is, an adversary is not choosing x based on the state of the table, but the choice of x is assumed independent of the existing hashed values. Otherwise, an adversary could try to find a pair x, x so that their sum matches the keySum at that cell and both x and x hash to that cell.) Further, each of the k − 1 other cells that x hashes to must have at least 2 keys from the IBLT hash there (or else we will determine that x is not in the IBLT, and hence x is not in the IBLT). Following the same analysis as used to find P fail when the key was in the IBLT, here we have the probability of failure is:
where p 2 is well-approximated by 2m2 e −nk/m . Similarly, we can derive the expected number of additional cells examined and the performance for the recursive get for keys not in the IBLT using the same type of analysis as previously. 4. Evaluation To test the effectiveness of the proposed modifications to the get procedure, we performed simulations to provide a case study. The parameters are as follows: m = 216 , k ranges from 2 to 7, and the load factor n/m varies from 5% to 50%. In each configuration, a get for every entry on the IBLT was performed. The simulation was repeated using different IBLTs until 1010 get operations were tested for each set of parameters. The results are shown in the set of plots presented in Fig. 1. Each plot includes the expected theoretical values and the simulated ones. For the simulation, random entries and standard hash functions are used. It can be observed that the improved get does reduce the probability of failure significantly. In particular, for low loads n/m the failure probability is reduced by several orders of magnitude. The reduction also increases when k increases. The recursive get provides some additional but minor gains. The benefit is larger for smaller values of k and for higher loads. Finally, in all cases, the simulation results match well the theoretical estimates given by Eqs. (1), (5) and (7). Table 1 shows the average increase in terms of number of cells that are examined by using the improved get operation with respect to the traditional get. We first provide the theoretical estimate (Th) from Eq. (6), which we recall is based on examining all cells for each x that appears in the same cell as the search key x with a count of 2. We also provide simulated results (Sim) for the case of lookups of keys in the IBLT, where each possible x is considered sequentially, the get operation can end as soon a key x appears alone in a cell. Theoretical and simulated results for the number of cells examined with the original get operation are also shown. The results are for the simulations reported in Fig. 1 where in addition to the probability of failure, the number of cells examined was also logged. While the number of additional cells that have to be examined on average is small in either case, it is notably smaller when the x are considered sequentially. Table 2 shows experimental results for the failure probability for keys not in the IBLT. In this case we fill an IBLT with random keys. After, 104 keys were generated randomly and it was checked that they were not in the IBLT. This experiment has been repeated 106 times, thus obtaining a total of 1010 get operations for each value of load factor and k. We report the fraction of get operations returning a failure (“not found” instead of null) for
S. Pontarelli et al. / Information Processing Letters 114 (2014) 185–191
189
Fig. 1. Probability of failure for the different get procedures for m = 216 and values of k from 2 to 7.
a key x not in the IBLT for the original and the improved get operations. Experimental results were in all cases in
good agreement with the estimated values computed using Eq. (9), which are also shown.
190
S. Pontarelli et al. / Information Processing Letters 114 (2014) 185–191
Table 1 Average number of additional cells accessed by the improved get operation, along with the original number of cells accessed. Load
Type
k=2
k=3
k=4
k=5
k=6
k=7
10%
increase (Th) increase (Sim) original (Th) original (Sim)
0.0594 0.0365 1.1813 1.1822
0.0896 0.0221 1.3264 1.3274
0.1153 0.0171 1.4742 1.4742
0.1454 0.0151 1.6332 1.6346
0.1847 0.0154 1.8067 1.8052
0.2376 0.0170 1.9973 1.9983
20%
increase (Th) increase (Sim) original (Th) original (Sim)
0.1768 0.1276 1.3297 1.3310
0.4022 0.1560 1.6548 1.6550
0.7203 0.1969 2.0209 2.0212
1.1747 0.2687 2.4439 2.4459
1.8069 0.3679 2.9335 2.9357
2.6517 0.5359 3.4965 3.4977
30%
increase (Th) increase (Sim) original (Th) original (Sim)
0.2971 0.2371 1.4512 1.4525
0.7732 0.4015 1.9456 1.9468
1.4801 0.6424 2.5284 2.5269
2.4382 1.0134 3.2135 3.2160
3.6168 1.5613 4.0036 4.0033
4.9324 2.3031 4.8935 4.8914
40%
increase (Th) increase (Sim) original (Th) original (Sim)
0.3959 0.3369 1.5507 1.5511
1.0590 0.6823 2.1871 2.1886
1.9706 1.1822 2.9434 2.9462
3.0260 1.8677 3.8177 3.8216
4.0599 2.7046 4.7930 4.7939
4.9080 3.5630 5.8448 5.8408
50%
increase (Th) increase (Sim) original (Th) original (Sim)
0.4651 0.4152 1.6321 1.6323
1.2120 0.9058 2.3804 2.3810
2.0997 1.5477 3.2588 3.2608
2.9137 2.2542 4.2438 4.2434
3.4711 2.8816 5.3010 5.3006
3.6931 3.2348 6.3968 6.3945
Table 2 Failure probability for the original and improved get operation. Load
Type
k=2
k=3
k=4
k=5
k=6
k=7
10%
original (Sim) original (Th) improved (Sim) improved (Th)
3.09E-4 3.07E-4 1.38E-6 1.32E-6
5.01E-5 5.04E-5 4.49E-8 4.66E-8
1.44E-5 1.44E-5 3.97E-9 3.95E-9
6.04E-6 5.97E-6 0 6.17E-10
3.32E-6 3.28E-6 3.05E-10 1.53E-10
2.20E-6 2.23E-6 0 5.41E-11
20%
original (Sim) original (Th) improved (Sim) improved (Th)
3.80E-3 3.79E-3 6.33E-5 6.28E-5
1.81E-3 1.81E-3 1.23E-5 1.24E-5
1.34E-3 1.34E-3 5.08E-6 5.06E-6
1.29E-3 1.29E-3 3.38E-6 3.34E-6
1.48E-3 1.47E-3 3.07E-6 3.06E-6
1.89E-3 1.89E-3 3.53E-6 3.55E-6
30%
original (Sim) original (Th) improved (Sim) improved (Th)
1.49E-2 1.49E-2 5.39E-4 5.34E-4
1.18E-2 1.18E-2 2.47E-4 2.48E-4
1.30E-2 1.30E-2 2.11E-4 2.11E-4
1.69E-2 1.69E-2 2.57E-4 2.55E-4
2.41E-2 2.40E-2 3.82E-4 3.82E-4
3.54E-2 3.54E-2 6.48E-4 6.48E-4
40%
original (Sim) original (Th) improved (Sim) improved (Th)
3.66E-2 3.66E-2 2.26E-3 2.25E-3
3.84E-2 3.84E-2 1.74E-3 1.75E-3
5.10E-2 5.09E-2 2.21E-3 2.20E-3
7.38E-2 7.39E-2 3.55E-3 3.53E-3
1.10E-1 1.09E-1 6.36E-3 6.35E-3
1.59E-1 1.59E-1 1.18E-2 1.18E-2
50%
original (Sim) original (Th) improved (Sim) improved (Th)
6.98E-2 6.98E-2 6.49E-3 6.45E-3
8.66E-2 8.65E-2 6.95E-3 6.98E-3
1.25E-1 1.24E-1 1.10E-2 1.09E-2
1.83E-1 1.84E-1 1.98E-2 1.98E-2
2.64E-1 2.64E-1 3.69E-2 3.68E-2
3.60E-1 3.60E-1 6.66E-2 6.67E-2
5. Conclusions Invertible Bloom Lookup Tables (IBLTs) have been recently proposed for several applications. In this paper, a technique to improve the performance of IBLTs has been presented. The proposed scheme reduces the probability that a key is not found when is actually stored in the IBLT, and similarly reduces the probability that a key not in the IBLT is returned with an ambiguous response. Our results show its effectiveness in reducing the probability of failure. Our non-recursive improved get operation can increase the number of cells that need to be examined to as many as k2 instead of k for the original get operation; however, we have shown it will in expectation be much less. The
significance of this cost depends on the application, but we expect for many practical applications it will prove a desirable option. References [1] B. Bloom, Space/time tradeoffs in hash coding with allowable errors, Commun. ACM 13 (7) (1970) 422–426. [2] A. Broder, M. Mitzenmacher, Network applications of Bloom filters: A survey, Internet Math. (2004) 495–509. [3] D. Eppstein, M. Goodrich, Straggler identification in round-trip data streams via Newton’s identities and invertible Bloom filters, IEEE Trans. Knowl. Data Eng. 23 (2) (2011) 297–306. [4] D. Eppstein, M. Goodrich, F. Uyeda, G. Varghese, What’s the difference? Efficient set reconciliation without prior context, ACM SIGCOMM Comput. Commun. Rev. 41 (4) (2011) 218–229.
S. Pontarelli et al. / Information Processing Letters 114 (2014) 185–191
[5] M.T. Goodrich, M. Mitzenmacher, Invertible Bloom lookup tables, in: Proc. of the 49th Allerton Conference on Communication, Control and Computing, 2011, pp. 792–799.
191
[6] M. Mitzenmacher, G. Varghese, Biff (Bloom filter) codes: Fast error correction for large data sets, in: Proc. of the IEEE International Symposium on Information Theory, 2012, pp. 483–487.