How to Approximate a Set Without Knowing It’s Size In Advance?

How to Approximate a Set Without Knowing It’s Size In Advance?
RasmusPagh Gil SegevUdi Wieder IT University of Copenhagen Microsoft Research Stanford

Set Membership Universe , Set Build a data structure such that given a query : If output ‘yes’ If output ‘no’

Approximate Set Membership Universe , Set Build a data structure such that given a query : If output ‘yes’ If output ‘no’ with probability , probability taken over internal randomness of the data structure. Leads to savings in the size of the data structure. |S|= n

ASM in a picture S ⊆ [100]×[100] |S|=188 |A(S)|=5213 |A(S)|=2699 |A(S)|=1580 |A(S)|=918

Applications Many…. Very common in practice Data Bases, Networking and more… Serves as a filter for accessing slow/bandwidth bounded data Requests arrive first at the filter which determines which requests reside in the proxy’s cache and which should be fetched from the network. The cost of a false positive is a cache miss. External Web Filter: Approximation of the Cache Request Web Proxy Cache

Lower Bounds for Static Case: [CFGMW78] Idea: Use the data structure to encode the set There are such sets, so length of encoding Create a data structure of bits. The data structure specifies a set of size . Encode the set relative to the data structure: Low order term could be improved to .

Upper Bounds – Bloom Filters Bloom Filters [Bloom 1970] Data structure is a bit array of size . Sample functions . For item x set bits to ‘’. Upon a query , answer ‘yes’ iffis for all . Minimum space is achieved when . Suboptimal space Supports online insertions No support for deletions. 1 1 1 X 1

Dictionary Based Upper Bounds Idea of CFGMW78: Store the multi-set where is a universal function in a dictionary If deletions is not needed store distinct values only If is stored succinctly: Efficient succinct dictionaries supporting insertions and deletions are given in [ANS10]

Separation of Static and Dynamic If the entire set is given in advance [DP08] show In fact solve a more general problem called ‘the Retrieval Problem’. If the items arrive one by one [LP10] show

But in practice…. The size of the set is not known in advance! Leads to over-provisioning of space up front Waste of space as long as the set is small Typically the data structure lies in prime real estate, the whole idea is saving space. Problem raised and handled in ‘practical’ papers Typically in a naïve way from a ‘theoretical’ point of view

Main Results (approximate) Say we have an upper bound on the size of the set and we are willing to allocate bits up front. Items arrive one by one. Lower Bound: There will be an such that after inserting items Upper Bound:The space above could be obtained with: Constant time membership queries Amortized constant time insertions. Could be de-amortized with a small multiplicative increase in space Super linear bound!

Lower Bound Sanity check: Use a dictionary to store , where . When a new item x arrives, build a new dictionary for . Indeed works when willing to pay bits upfront even for the first items. This is the case when – standard dictionary If is the ‘yes’ instances then is not much bigger than We show that in our case is much bigger than

Lower Bound – proof sketch Partition inserted items to epochs of . is a prefix of epochs The total error is so so such that ...

Lower Bound: the encoding Idea: Use the data structure to encode - a set of size Write down explicitly, followed by the data structure using bits Encoding so far characterizes and A set of size is encoded using Impliesa space of bits per item

Upper Bound – Construction 1 Almeida et al., 2007: Keep data structures. has elements and an FPR such that Approximation is union of these sets. Suggest to choose Space is asymptotically bad. Last structure has bits. Query time logarithmic Our suggestion: Choose Space is now Query time still logarithmic

Getting Constant Query Time As before, partition insertions to epochs of length Recall dictionary based solution: Store a fingerprint where is a universal function, in a dictionary. Idea: Store the fingerprint with extra bits as ‘data’: fingerprint bits

Getting Constant Query Time Within an epoch, essentially the dictionary is similar to standard solution With an extra bits per element When an Epoch ends: Move one bit of ‘data’ to the fingerprint. Rebuild the dictionary with new fingerprints (one bit longer per fingerprint). What if all the ‘data’ had already been used? Insert into the dictionary h(x).0 and h(x).1 with empty data

Analysis An item inserted in epoch j will create insertions There are items in epoch j, so total number of insertions is Length of fingerprint is , so space of dictionary for fingerprints is . Extra bits for the ‘data’ part.

Extensions and standard tricks Extra space required when rebuilding the new dictionary. Both dictionaries need to be stored until the rebuild is complete. This can be mitigated by bucketing items into many smaller dictionaries, rebuilding the smaller dictionaries one at a time. De-amortization of Insert, Each time an item is inserting, perform O(1) operations on the next dictionary. Not compatible with bucketing technique, requires a small increase in space.

Supporting Deletions Necessary assumption: Only items that are in the set are ever deleted. The removal of a ‘false positive’ item may introduce false negatives The assumption makes sense in many applications when data structure filters a cache Standard approach of storing multi-sets is problematic. An item generates many signatures, can’t tell which one to remove. Upon insertions, if fingerprint already appears put it in a secondary structure. Upon removal check secondary structure first. Requires assumption that each item is inserted only once Requires some extra book keeping.

Open Problems Bridge a theory – practice gap Practitioners seem content with the solution of multiple bloom filters But then, practitioners seem content with Bloom Filters… Get the leading constant in front of log log nTHANK YOU

How to Approximate a Set Without Knowing It’s Size In Advance?