1 / 36

Maintaining Stream Statistics Over Sliding Windows

Maintaining Stream Statistics Over Sliding Windows. Paper by Mayur Datar, Aristides Gionis, Piotr Indyk, Rajeev Motwani. Presentation by Adam Morrison. Sliding Window Intro. Infinite stream. Only last N elements relevant. Packet streams. N is huge. Stronger model…. 1. 2. 3. 4.

slone
Download Presentation

Maintaining Stream Statistics Over Sliding Windows

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Maintaining Stream Statistics Over Sliding Windows Paper by Mayur Datar, Aristides Gionis, Piotr Indyk, Rajeev Motwani Presentation by Adam Morrison.

  2. Sliding Window Intro • Infinite stream. • Only last N elements relevant. • Packet streams. • N is huge. • Stronger model…

  3. 1 2 3 4 Model • Count memory bits. • Online algorithm. Arrival: 5 6 7 Timestamp: 3 2 1 3 2 1 3 2 1

  4. Plan • Basic Counting • Given a bit stream, maintain at every time instant the count of 1s in the last N elements. • Sum • Given an integer stream, maintain the sum of the last N elements. • Everything else

  5. Basic Counting • Exact Solution? (Counter?) Exact solution requires (N) bits. 2 1 1 0 2 1

  6. =0.05 95 100 105 Approximate Basic Counting • Solution: Approximate the answer and bound the relative error

  7. Bucket sizes? Policy for creating new buckets? What is it good for? The idea • Dynamic histogram of active 1s. • New 1s go into right most bucket. • For each bucket keep the timestamp of the most recent 1 and the bucket’s size. • When timestamp expires, free bucket.

  8. Timestamp: Size: 1 1 2 3 4 5 1 1 2 2 2 2 2 1 Example (N=4)

  9. 9 10 11 12 13 14 14 9 10 11 12 13 14 0 0 5 4 3 2 1 0 6 5 4 3 2 1 0 (Timestamps are easy) Cyclic counter mod N. N=15

  10. What does the histogram buy us? • Active bucket  Contains an active 1. • Only the last bucket might contain expired 1s.

  11. Estimating number of 1s Conclusion: • T – sum of all bucket sizes but last. • So there are at least T 1s. • C – size of last bucket. • Actual # of 1s can be anything from 1 to C.

  12. Bucket sizes: True count Absolute Relative

  13. If at all times we’d have that for all j, Bounding the error Goal: Relative error at most =1/k.

  14. Exponential Histogram How can we do that?(With as few buckets as possible?) • Non-decreasing bucket sizes. • Bucket sizes constrained to • At most buckets of each size. • For all sizes but that of last bucket, at least buckets of each size.

  15. 4 2 4 4 2 2 2 2 1 1 1 1 3 1 2 2 3 2 2 1 1 1 3 1 2 1 4 2 4 1 1 1 5 2 5 2 4 1 2 1 6 2 7 2 7 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 4 2 3 2 2 2 2 1 1 1 3 1 New 1 – create bucket Check if invariant violated. Too many buckets – merge

  16. Why it works (correctness) If there are at least buckets of sizes

  17. Why it works (space) • Can account for all 1s with just

  18. Space usage # of buckets: Bucket size: T counter for estimation:

  19. Bucket of size B accounts for all operations related to it: B inserts, B-1 merges (& maybe delete). Sum of all buckets in life time (including deleted) is all insertions. past Operations • Estimation: O(1) • Insertion: Cascading makes it worst case. • But only O(1) amortized!

  20. Plan • Basic Counting • Given a bit stream, maintain at every time instant the count of 1s in the last N elements. • Sum • Given an integer stream, maintain the sum of the last N elements. • Everything else

  21. Extending to Sum • Integers in range [0, R]. • On value V, insert V 1s. • Timestamps: • Bucket counter: • # of buckets: • Total space: Insertion takes (R)!

  22. Picking gives amortized time. Reducing insertion time • If we had a way to rebuild the entire histogram… • We could buffer new values… • And rebuild histogram when buffer reaches size B. • If it takes , amortized is

  23. Would it really? Is this representation unique? k/2 canonical representation The k/2 canonical representation of S : If S is the total size of the buckets, computing its k/2 canonical representation would help us rebuild the histogram.

  24. Find the largest j for which If find Total time required is O(log S). =01 j=2 =5

  25. 2 5 If a value gets “unindexed”, it will never be indexed in the future. 8 6 4 3 2 1 10 8 6 5 4 3 9 7 5 4 3 2 Calculate S1+S2 representation: 10 6 2 1 1 1 1

  26. Lower Bounds • More about timestamps. • Applications. • More problems Plan • Basic Counting • Given a bit stream, maintain at every time instant the count of 1s in the last N elements. • Sum • Given an integer stream, maintain the sum of the last N elements. • Everything else

  27. Lower Bounds • More about timestamps. • Applications. • More problems Lower bounds • Basic Counting and Sum algorithms are optimal. • Similar techniques will show that lots of other problems are intractable. (Later.)

  28. Basic Counting bound N

  29. Big block d Left most such subblock Same idea works for Sum.

  30. Lower bound applies to randomized algorithms. Randomized bound • Yao minimax principle: • Expected space complexity of optimal algorithm for an input distribution is a lower bound on expected space complexity of randomized algorithm.

  31. Lower Bounds • More about timestamps. • Applications. • More problems Timestamps If much less than N items can arrive during the window, memory usage is reduced. • Define window based on real time – equate timestamp with clock. • No work needs to be done when items don’t arrive, so deletions can be deferred.

  32. Lower Bounds • More about timestamps. • Applications. • More problems Applications • Adapting algorithms to the sliding window model using EH to replace counters. • Counters require bits, EH takes . • Also factor loss in accuracy.

  33. Lower Bounds • More about timestamps. • Applications. • More problems More Problems • Min/Max • Storing subsequence of (say) mins is optimal. • Distinct values • Basic Counting reduces to it.

  34. Other Problems • Distinct values with deletions. • Factor 2 estimation requires (N) space. • Map 1s in a bit string to distinct values. Pad with zeros to infer value of last bit, then use deletion to cancel that bit. • Repeat.

  35. Other Problems • Sum with negative integers. • Factor 2 estimation requires (N) space. • Maps 1s in bit string to (-1,1) and 0s to (1,-1). • Pad with 0s and query at odd time instants.

  36. END

More Related