1 / 25

Compact Representations in Streaming Algorithms

Compact Representations in Streaming Algorithms. Moses Charikar Princeton University. Talk Outline. Statistical properties of data streams Distinct elements Frequency moments, norm estimation Frequent items. Frequency Moments [Alon, Matias, Szegedy ’99].

louie
Download Presentation

Compact Representations in Streaming Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compact Representations in Streaming Algorithms Moses CharikarPrinceton University

  2. Talk Outline • Statistical properties of data streams • Distinct elements • Frequency moments, norm estimation • Frequent items

  3. Frequency Moments[Alon, Matias, Szegedy ’99] • Stream consists of elements from {1,2,…,n} • mi = number of times i occurs • Frequency moment • F0 = number of distinct elements • F1 = size of stream • F2 =

  4. Overall Scheme • Design estimator (i.e. random variable) with the right expectation • If estimator is tightly concentrated, maintain number of independent copies of estimator E1, E2, …, Er • Obtain estimate E from E1, E2, …, Er • Within (1+) with probability 1-

  5. Randomness • Design estimator assuming perfect hash functions, as much randomness as needed • Too much space required to explicitly store such a hash function • Fix later by showing that limited randomness suffices

  6. Distinct Elements • Estimate the number of distinct elements in a data stream • “Brute Force solution”: Maintain list of distinct elements seen so far • (n) storage • Can we do better ?

  7. Distinct Elements[Flajolet, Martin ’83] • Pick a random hash function h:[n]  [0,1] • Saythere are k distinct elements • Then minimum value of h over k distinct elements is around 1/k • Apply h() to every element of data stream; maintain minimum value • Estimator = 1/minimum

  8. (Idealized) Analysis • Assume perfectly random hash function h:[n]  [0,1] • S: set of k elements of [n] • X = min aS { h(a) } • E[X] = 1/(k+1) • Var[X] = O(1/k2) • Mean of O(1/2) independent estimators is within (1+) of 1/k with constant probability

  9. Analysis • [Alon,Matias,Szegedy]Analysis goes through with pairwise independent hash functionh(x) = ax+b • 2 approximation • O(log n) space • Many improvements[Bar-Yossef,Jayram,Kumar,Sivakumar,Trevisan]

  10. Estimating F2 • F2 = • “Brute force solution”: Maintain counters for all distinct elements • Sampling ? • n1/2space

  11. Estimating F2[Alon,Matias,Szegedy] • Pick a random hash functionh:[n]  {+1,-1} • hi = h(i) • Z = • Z initially 0, add hievery time you see i • Estimator X = Z2

  12. Analyzing the F2 estimator

  13. Analyzing the F2 estimator • Median of means gives good estimator

  14. What about the randomness ? • Analysis only requires 4-wise independence of hash function h • Pick h from 4-wise independent family • O(log n) space representation, efficient computation of h(i)

  15. Properties of F2 estimator • “sketch” of data stream that allows computation of • Linear function of mi • Can be added, subtracted • Given two streams, frequencies mi , ni • E[(Z1-Z2)2] = • Estimate L2 norm of difference • How about L1 norm ? Lp norm ?

  16. Stable Distributions • p-Stable distribution DIf X1, X2, … Xn are i.i.d. samples from D,m1X1+m2X2+…mnXn is distributed as||(m1,m2,…,mn)||pX • Defining property up to scale factor • Gaussian distribution is 2-stable • Cauchy distribution is 1-stable • p-Stable distributions exist for all0 < p  2

  17. Talk Outline • Similarity preserving hash functions • Similarity estimation • Statistical properties of data streams • Distinct elements • Frequency moments, norm estimation • Frequent items

  18. Variants of F2 estimator[Alon, Gibbons, Matias, Szegedy] • Estimate join size of two relations(m1,m2,…) (n1,n2,…) • Variance may be too high

  19. Finding Frequent Items [C,Chen,Farach-Colton ’02] Goal: Given a data stream, return an approximate list of the k most frequent items in one pass and sub-linear space Applications: Analyzing search engine queries, network traffic.

  20. Finding Frequent Items ai: ith most frequent element mi : frequency If we hadan oracle that gave us exact frequencies, can find most frequent items in one pass Solution: A data structure called a Count Sketch that gives good estimates of frequencies of the high frequency elements at every point in the stream

  21. Intuition • Consider a single counter X with a single hash function h:{a}  { +1, -1} • On seeing each element ai, update the counter with X += h(ai) • X =  mi • h(ai) • Claim: E[X • h(ai)] = mi • Proof idea: Cross-terms cancel because of pairwise independence

  22. Finding the max element • Problem with the single counter scheme: variance is too high • Replace with an array of t counters, using independent hash functions h1... ht h1: a  {+1, -1} ht: a  {+1, -1}

  23. Analysis of “array of counters” data structure • Expectation still correct • Claim: Variance of final estimate <  mi2 /t • Variance of each estimate <  mi2 • Proof idea: cross-terms cancel • Set t = O(log n •  mi2 / (m1)2) to get answer with high prob. • Proof idea: “median of averages”

  24. Problem with “array of counters” data structure • Variance of estimator dominated by contribution of large elements • Estimates for important elements such as ak corrupted by larger elements (variance much more than mk2) • To avoid collisions, replace each counter with a hash table of b counters to spread out the large elements

  25. In Conclusion • Simple powerful ideas at the heart of several algorithmic techniques for large data sets • “Sketches” of data tailored to applications • Many interesting research questions

More Related