1 / 31

Sublinear Algorihms for Big Data

Sublinear Algorihms for Big Data. Lecture 2. Grigory Yaroslavtsev http://grigory.us. Recap. (Markov) For every ( Chebyshev ) For every ( Chernoff ) Let be independent and identically distributed r.vs with range [0, c] and expectation . Then if and. Today. Approximate Median

zubin
Download Presentation

Sublinear Algorihms for Big Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SublinearAlgorihms for Big Data Lecture 2 GrigoryYaroslavtsev http://grigory.us

  2. Recap • (Markov) For every • (Chebyshev) For every • (Chernoff) Let be independent and identically distributedr.vs with range [0, c] and expectation . Then if and

  3. Today • Approximate Median • Alon-Mathias-Szegedy Sampling • Frequency Moments • Distinct Elements • Count-Min

  4. Data Streams • Stream: elements from universe , e.g. • = frequency of in the stream = # of occurrences of value

  5. Approximate Median • (all distinct) and let • Problem: Find -approximate median, i.e. such that • Exercise: Can we approximate the value of the median with additive error in sublinear time? • Algorithm: Return the median of a sample of size taken from (with replacement).

  6. Approximate Median • Problem: Find -approximate median, i.e. such that • Algorithm: Return the median of a sample of size taken from (with replacement). • Claim: If then this algorithm gives -median with probability

  7. Approximate Median • Partition into 3 groups • Key fact: If less than elements from each of and are in sample then its median is in • Let if -th sample is in and 0 otherwise. • Let . By Chernoff, if • Same for + union bound error probability

  8. AMS Sampling • Problem: Estimate , for an arbitrary function with • Estimator: Sample ,where is sampled uniformly at random from and compute: Output: • Expectation:

  9. Frequency Moments • Define for • # number of distinct elements • # elements • = “Gini index”, “surprise index”

  10. Frequency Moments • Define for • Use AMS estimator with • Exercise: , where • Repeat times and take average . By Chernoff: • Taking gives

  11. Frequency Moments • Lemma: • Result: memory suffices for -approximation of • Question: What if we don’t know • Then we can use probabilistic guessing (similar to Morris’s algorithm), replacing with .

  12. Frequency Moments • Lemma: • Exercise:Hint: worst-case when . Use convexity of • Case 1:

  13. Frequency Moments • Lemma: • Case 2:

  14. Hash Functions • Definition: A family of functions from is -wise independent if for any distinct and • Example: If for prime is a -wise independent family of hash functions.

  15. Linear Sketches • Sketching algorithm: picks a random matrix where and computes • Can be incrementally updated: • We have a sketch • When arrives, new frequencies are • Updating the sketch: • Need to choose random matrices carefully

  16. Problem: -approximation for • Algorithm: • Let , where entries of each row are -wise independent and rows are independent • Don’t store the matrix: 4-wise independent hash functions • Compute average squared entries “appropriately” • Analysis: • Let be any entry of . • Lemma: • Lemma: Var

  17. Let be a row of with entries • We used 2-wise independence for .

  18. by 4-wise independence

  19. : Distinct Elements • Problem:-approximation for • Simplified: For fixed with prob. distinguish: vs. • Original problem reduces by trying values of T:

  20. : Distinct Elements • Simplified: For fixed with prob. distinguish: vs. • Algorithm: • Choose random sets where • Compute • If at least of the values are zero, output

  21. vs. • Algorithm: • Choose random sets where • Compute • If at least of the values are zero, output • Analysis: • If , then • If , then • Chernoff: gives correctness w.p.

  22. vs. • Analysis: • If , then • If , then • If is large and is small then: • If • If

  23. Count-Min Sketch • https://sites.google.com/site/countminsketch/ • Stream: elements from universe , e.g. • = frequency of in the stream = # of occurrences of value • Problems: • Point Query: For estimate • Range Query: For estimate • Quantile Query: For find with • Heavy Hitters: For find all with

  24. Count-Min Sketch: Construction • Let be -wise independent hash functions • We maintain counters with values: elements in the stream with • For every the value and so: • If and then: .

  25. Count-Min Sketch: Analysis • Define random variables such that • Define if and 0 otherwise: • By -wise independence: • By Markov inequality,

  26. Count-Min Sketch: Analysis • All are independent • The w.p. there exists such that • CountMin estimates values up to with total memory

  27. Dyadic Intervals • Define partitions of : … • Exercise: Any interval can be written as a disjoint union of at most such intervals. • Example: For

  28. Count-Min: Range Queries and Quantiles • Range Query: For estimate • Approximate median: Find such that: and

  29. Count-Min: Range Queries and Quantiles • Algorithm: Construct Count-Min sketches, one for each such that for any we have an estimate for such that: • To estimate let be decomposition: • Hence,

  30. Count-Min: Heavy Hitters • Heavy Hitters: For find all with but no elements with • Algorithm: • Consider binary tree whose leaves are [n] and associate internal nodes with intervals corresponding to descendant leaves • Compute Count-Min sketches for each • Level-by-level from root, mark children of marked nodes if • Return all marked leaves • Finds heavy-hitters in steps

  31. Thank you! • Questions?

More Related