1 / 20

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream. Michael Mitzenmacher Salil Vadhan. And improvements with Kai-Min Chung. The Question. Traditional analyses of hashing-based algorithms & data structures assume a truly random hash function.

tobit
Download Presentation

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Why Simple Hash Functions Work :Exploiting the Entropyin a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung

  2. The Question • Traditional analyses of hashing-based algorithms & data structures assume atruly random hash function. • In practice: simple (e.g. universal) hash functions perform just as well. • Why?

  3. Outline • Three hashing applications • The new model and results • Proof ideas

  4. Bloom Filters To approximately store S = {x1,…,xT}[N]: • Start with array of M=O(T) zeroes. • Hash each item k=O(1) times to [M] using h : [N]  [M]k, put a one in each location. To test yS: • Hash & accept if ones in all k locations.

  5. Bloom Filter Analysis Thm [B70]:S yS, if h is a truly random hash function, Prh[accept y] = 2-(ln 2)·M/T+o(1). for an optimal choice of k.

  6. Balanced Allocations • Hashing T items into T buckets • What is the maximum number of items, or load, of any bucket? • Assume buckets chosen independently & uniformly at random. • Well-known result: (log T / log log T) maximum load w.h.p.

  7. Power of Two Choices • Suppose each ball can pick two bins independently and uniformly and choose the bin with less load. • Thm [ABKU94]: maximum load log log n / log 2 + (1) w.h.p.

  8. Linear Probing • Hash elements into an array of length M. • If h(x) is already full, try h(x)+1,h(x)+2,… until empty spot is found, place x there. • Thm [K63]: Expected insertion time for T’th item is 1/(1-(T/M)2)+o(1).

  9. Explicit Hash Functions Can sometimes analyze for explicit (e.g. universal [CW79]) hash functions, but • performance somewhat worse, and/or • hash functions complex/inefficient. Noted since 1970’s that simple hash functions match idealized performance on real data.

  10. Simple Hash Functions Don’t Always Work •  pairwise independent hash families & inputs s.t. Linear Probing has (log T) insertion time [PPR07]. • k-wise independent hash families & inputs s.t. Bloom Filter error prob. higher than ideal [MV08]. • Open for Balanced Allocations. • Worst case does not match practice.

  11. Average-Case Analysis? • Data uniform & independent in [N]. • Not a good model for real data. • Trivializes hashing. • Need intermediate model between worst-case and average-case analysis.

  12. Our Model: Block Sources [CG85] • Data is a finite stream, modeled by a sequence of random variables X1,X2,…XT[N] • Each stream element has some kbits of (Renyi) entropy, conditioned on previous elements:where cp(X)=xPr[X=x]2. • Similar spirit to semi-random graphs [BS95], smoothed analysis [ST01].

  13. An Approach • Htruly random: for all distinct x1,…,xT, (H(x1),.. H(xT)) uniform in [M]T. • Goal: if Hrandom universal hash function and X1,X2,…XTis a block source, then (H(X1),.. H(XT)) is “close” to uniform. Randomness extractors!

  14. Classic Extractor Results[BBR88,ILL89,CG85,Z90] • Leftover Hash Lemma:If H : [N]  [M] is a random universal hash function and X has Renyi entropy at least log M + 2log(1/), then(H,H(X)) is -close to uniform. • Thm: If H : [N]  [M] is a random universal hash function and X1,X2,…XT is a block source with Renyi entropy at least log M + 2log(T/) per block, then(H,H(X1),.. H(XT))is -close to uniform.

  15. Sample Parameters • Network flows (IP addresses, ports, transport protocol): N = 2104 • Number of items: T = 216 • Hash range (2 values per item): M = 232. • Entropy needed per item: 64+2log(1/ ). • Can we do better?

  16. Improved Bounds I Thm [CV08]: If H : [N]  [M] is a random universal hash function and X1,X2,…XT is a block source with Renyi entropy at least log M+log T+2log(1/)+O(1) per block, then(H,H(X1),.. H(XT))is -close to uniform. Tight up to additive constant [CV08].

  17. Improved Bounds II Thm [MV08,CV08]: If H : [N]  [M] is a random universal hash function and X1,X2,…XT is a block source with Renyi entropy at least log M+log T+log(1/)+O(1) per block, then(H,H(X1),.. H(XT))is -close to a distribution with collision probability O(1/MT). Tight upto dependence on  [CV08].

  18. Proof Ideas: Upper Bounds 1. Bound average conditional collision probs: cp(H(Xi)| H,H(X1),.. H(Xi-1))  1/M+1/2k. 2a. Statistical closeness to uniform: inductively bound “Hellinger distance” from uniform. 2b. Close to small collision prob: by Markov, get (1/T) ·icp(H(Xi)| H=h,H(X1)=y1,.. H(Xi-1)=yi-1)  1/M+1/(2k) w.p. 1-  over h,y1,..,yi-1

  19. Proof Ideas: Lower Bounds • Lower bound for randomness extractors [RT97]: if k not large enough, then  X of min-entropy k s.t. h(X) “far” from uniform for most h. • Take X1,X2,…XT to be iid copies of X. • Show that error accumulates, e.g. statistical distance grows by a factor of (T) [R04,CV08].

  20. Open Problems • Tightening connection to practice. • How to estimate relevant entropy of data streams? • Cryptographic hash functions (MD5,SHA-1)? • Other data models? • Block source data model. • Other uses, implications?

More Related