Optimized Bloom Filters: Efficiency in Set Membership Approximation

Bloom filters Probability and Computing Randomized algorithms and probabilistic analysis P109~P111 Michael Mitzenmacher Eli Upfal

Introduction • Approximate set membership problem . • Trade-off between the space and the false positive probability . • Generalize the hashing ideas.

Approximate set membership problem • Suppose we have a set S = {s1,s2,...,sm}  universe U • Represent S in such a way we can quickly answer “Is x an element of S ?” • To take as little space as possible ,we allow false positive (i.e. xS , but we answer yes ) • If xS , we must answer yes .

Bloom filters • Consist of an arrays A[n] of n bits (space) , and k independent random hash functions h1,…,hk : U --> {0,1,..,n-1} 1. Initially set the array to 0 2.  sS, A[hi(s)] = 1 for 1 i  k (an entry can be set to 1 multiple times, only the first times has an effect ) 3. To check if xS , we check whether all location A[hi(x)] for 1 i  k are set to 1 If not, clearly xS. If all A[hi(x)] are set to 1 ,we assume xS

y y 1 1 1 1 1 1 1 1 1 1 To check if y is in S, check the k hash location. If a 0 appears , y is not in S If only 1s appear, conclude that y is in S This may yield false positive x1 x2 1 1 1 1 1 Each element of S is hashed k times Each hash location set to 1 0 0 0 0 0 0 0 0 0 0 0 0 Initial with all 0

The probability of a false positive • We assume the hash function are random. • After all the elements of S are hashed into the bloom filters ,the probability that a specific bit is still 0 is

To simplify the analysis ,we can assume a fraction p of the entries are still 0 after all the elements of S are hashed into bloom filters. • In fact,let X be the random variable of number of those 0 positions. By Chernoff bound It implies X/n will be very close to p with a very high probability

The probability of a false positive f is • To find the optimal k to minimize f . Minimize f iff minimize g=ln(f) • k=ln(2)*(n/m) • f = (1/2)k = (0.6185..)n/m The false positive probability falls exponentially in n/m ,the number bits used per item !!

Conclusion • A Bloom filters is like a hash table ,and simply uses one bit to keep track whether an item hashed to the location. • If k=1 , it’s equivalent to a hashing based fingerprint system. • If n=cm for small constant c,such as c=8 ,then k=5 or 6 ,the false positive probability is just over 2% . • It’s interesting that when k is optimal k=ln(2)*(n/m) , then p= 1/2. An optimized Bloom filters looks like a random bit-string

Optimized Bloom Filters: Efficiency in Set Membership Approximation

Optimized Bloom Filters: Efficiency in Set Membership Approximation

Presentation Transcript

Codes, Bloom Filters, and Overlay Networks

Bloom Filters: A History and Modern Applications

Bloom Based Filters for Hiera r chical Data

Bloom Filters

Bloom filters

Network Applications of Bloom Filters: A Survey

Bloom Filters

Bloom Filters

Payload Attribution via Hierarchical Bloom Filters

Bloom Filters

Measurement Algorithms: Bloom Filters and Beyond

Fast Packet Classification Using Bloom filters

Improving State Coverage Using Bloom Filters

Service Identifiers and Bloom Filters

Bloom Filters

Bloom Filters

Deep Packet Inspection Using Parallel Bloom Filters

Bloom Filters

Optimizing Data Popularity Conscious Bloom Filters

Bloom filters

Beyond Bloom Filters: Approximate Concurrent State Machines

Bloom Filters