1 / 18

Algorithms for data streams Lecture 2

Algorithms for data streams Lecture 2. Foundations of Data Science 2014 Indian Institute of Science Navin Goyal. Estimating using the AMS sketch. Given a turnstile stream estimate within multiplicative error with probability at least

jaclyn
Download Presentation

Algorithms for data streams Lecture 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms for data streamsLecture 2 Foundations of Data Science 2014 Indian Institute of Science Navin Goyal

  2. Estimating using the AMS sketch • Given a turnstile stream estimate within multiplicative error with probability at least • Obvious solution takes space (maintain the frequency vector). Can’t do better deterministically • Randomized algorithm [Alon—Matias—Szegedy ’96]: • Sample a random vector with each coordinate chosen uniformly at random from independently • So if we could compute then we could estimate

  3. Basic AMS algorithm for • Given a turnstile stream estimate within multiplicative error with probability at least Basic AMS estimator: • Choose a random vector • Initialize • Until the end of the stream do • On arrival of element • At the end of the stream • is an estimator of • Problem: requires space

  4. is a reasonable estimator of • (proof on the board; also in the book) • Application of Chebyshev: • Can improve by the median of the means estimator: and ,… • Output median • This gives -approximation of

  5. The AMS sketch • How much space does the basic AMS sketch take (without the median of the means trick)? • (assuming are bounded by a constant) • So space is sufficient • No! • We also need to remember random vector • And this requires bits • What essential property of the random vector did we use?

  6. The AMS sketch • What essential property of the random vector did we use? • For , we used for all • For , we used for all pairwise distinct • This is satisfied if the are 4-wise independent: For any pairwise distinct random variables are mutually independent • For our situation, this means that for any we have

  7. Constructing pairwise independent random bit vectors • Given a uniformly random vector ( bits of perfect randomness) • We use to construct a pairwise independent random vector ( bits of useful randomness) • We index by nonempty subsets of • For define ClaimFor distinct and nonemptly, and are independent and uniformly distributed ProofOn the board • are not 3-wise independent

  8. 2-wise independent hash function families • Very useful concept both in theory and practice • Let and • A family of functions is called -wise independent if for any distinct , and any , and for chosen uniformly at random from , we have (Also called -universal family) • The set of all functions is 2-universal • It’s very large: , describing one function takes bits

  9. Pairwise independent random vectors 2-wise independent hash functions • We say that random vector is pairwise independent if for any distinct we have and are independent • A random hash function from a 2-wise independent hash function family of functions mapping gives us a pairwise independent random vector: with • Hash function language slightly more convenient in some situations • A non-streaming example of the utility of 2-wise independence: MAX CUT

  10. Constructing 2-wise independent hash function families • There are much smaller 2-wise independent families than the family of all functions • Suppose a prime number • For define : by • Intuition: Determining a line in the plane requires two distinct points on the line • This gives a family of size • is 2-wise independent • Need bits to store a function in • Evaluation of is constant time on RAM (or certainly

  11. Constructing 2-wise independent hash function families using finite fields • More generally, we could take for some positive integer • : the finite field with elements • The elements of can be represented as bitvector of length • The field provides a way to add and multiply the elements in time • For (the finite field with elements) define by • Need bits to represent

  12. 2-wise independent hash function families • Can achieve and : • Elements of can be represented as -tuples • Represent in this way: • And define the new hash function by keeping just the first coordinate : Claim Functions above form a 2-wise independent hash function family Proof On the board

  13. -wise independent hash function families • A family of functions is called-wise independent if for all distinct , and any , and for chosen uniformly at random from , we have • The family of all functions is -wise independent • There exist much smaller families obtained by generalizing the construction for pairwise independent hash families: • or (a prime number) For a -tuple define by • The above family is a-wise independent family of size • Intuition: A degree polynomial is fully specified by its values at points

  14. Constructing 4-wise independent random -1/1-vector • Choose sufficiently large so that • Construct a 4-wise independent hash function family mapping • Define by • Functions form a -wise independent family • To generate a -wise independent random vector first choose a random • The random vector is • This is a -vector • To construct a -vector map to in the above vector

  15. Basic AMS algorithm for Basic AMS estimator with fully independent random vector: • Choose a random vector • Initialize • Until the end of the stream do • On arrival of element Basic AMS estimator with -wise independent random vector: • Choose a random vector • Initialize • Until the end of the stream do • On arrival of element • can be evaluated in time

  16. Back to the AMS sketch • Generate using a 4-wise independent family of hash functions from to • Requires space • Total space for the basic AMS sketch • Improve by the median of the means estimator: and ,… • Output median • Total space used • (-approximation)

  17. AMS sketch is linear • The algorithm maintains • Corollary Given two streams and , we can get the sketch for their concatenation their sketches by adding them: • Geometric interpretation of the AMS sketch: Similar to Johnson—Lindenstrauss projection trick that preserves the length • Works in the turnstile model because of the linearity of the AMS sketch

  18. Other ’s • For , algorithms with space [Indyk 2000] and later improvements (nearly tight) • For the problem becomes hard: (nearly tight)

More Related