1 / 18

Sampling From a Moving Window Over Streaming Data

Sampling From a Moving Window Over Streaming Data . Brian Babcock * Mayur Datar Rajeev Motwani. Stanford University. * Speaker. Continuous Data Streams. Data streams arise in a number of applications IP packets in a network Call records (telecom) Cash register data (retail sales)

rhoda
Download Presentation

Sampling From a Moving Window Over Streaming Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sampling From a Moving Window Over Streaming Data Brian Babcock*Mayur DatarRajeev Motwani Stanford University *Speaker

  2. Continuous Data Streams • Data streams arise in a number of applications • IP packets in a network • Call records (telecom) • Cash register data (retail sales) • Sensor networks • Large volumes of data • Online processing • Data is read once and discarded • Memory is limited

  3. Why Moving Windows? • Timeliness matters • Old/obsolete data is not useful • Scalability matters • Querying the entire history may be impractical • Solution: restrict queries to a window of recent data • As new data arrives, old data “expires” • Addresses timeliness and scalability

  4. Two Types of Windows • Sequence-Based • The most recent n elements from the data stream • Assumes a (possibly implicit) sequence number for each element • Timestamp-Based • All elements from the data stream in the last m units of time (e.g. last 1 week) • Assumes a (possibly implicit) arrival timestamp for each element • Sequence-based is the focus for most of the talk

  5. Sampling From a Data Stream • Inputs: • Sample size k • Window size n >> k (alternatively, time duration m) • Stream of data elements that arrive online • Output: • k elements chosen uniformly at random from the last n elements (alternatively, from all elements that have arrived in the last m time units) • Goal: • maintain a data structure that can produce the desired output at any time upon request

  6. A Simple, Unsatisfying Approach • Choose a random subset X={x1, …,xk}, X{0,1,…,n-1} • The sample always consists of the non-expired elements whose indexes are equal to x1, …,xk (modulo n) • Only uses O(k) memory • Technically produces a uniform random sample of each window, but unsatisfying because the sample is highly periodic • Unsuitable for many real applications, particularly those with periodicity in the data

  7. Another Simple Approach: Oversample • As each element arrives remember it with probability p = ck/n log n; otherwise discard it • Discard elements when they expire • When asked to produce a sample, choose k elements at random from the set in memory • Expected memory usage of O(k log n) • Uses O(k log n) memory whp • The algorithm can fail if less than k elements from a window are remembered; however whp this will not happen

  8. Reservoir Sampling • Classic online algorithm due to Vitter (1985) • Maintains a fixed-size uniform random sample • Size of the data stream need not be known in advance • Data structure: “reservoir” of k data elements • As the ith data element arrives: • Add it to the reservoir with probability p = k/i, discarding a randomly chosen data element from the reservoir to make room • Otherwise (with probability 1-p) discard it

  9. Why It Doesn’t Work With Moving Windows • Suppose an element in the reservoir expires • Need to replace it with a randomly-chosen element from the current window • However, in the data stream model we have no access to past data • Could store the entire window but this would require O(n) memory

  10. Chain-Sample • Include each new element in the sample with probability 1/min(i,n) • As each element is added to the sample, choose the index of the element that will replace it when it expires • When the ith element expires, the window will be (i+1…i+n), so choose the index from this range • Once the element with that index arrives, store it and choose the index that will replace it in turn, building a “chain” of potential replacements • When an element is chosen to be discarded from the sample, discard its “chain” as well

  11. 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 35 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 Example 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

  12. { 0 for x < 0 1 + 1/n [ΣT(j)] for x  1 j<i Memory Usage of Chain-Sample • Let T(x) denote the expected length of the chain from the element with index i when the most recent index is i+x • T(x) = • The expected length of each chain is less than T(n)  e  2.718 • Expected memory usage is O(k)

  13. Memory Usage of Chain-Sample • Chain consists of “hops” with lengths 1…n • Chain of length  j can be represented by partition of n into j ordered integer parts • j-1 hops with sum less than n plus a remainder • Each such partition has probability n-j • Number of such partitions is (n) < (ne/j)j • Probability of any such partition is small [O(n-c)]when j = O(k log n) • Uses O(k log n) memory whp j

  14. Comparison of Algorithms • Chain-sample is preferable to oversampling: • Better expected memory usage: O(k) vs. O(k log n) • Same high-probability memory bound of O(k log n) • No chance of failure due to sample size shrinking below k

  15. Timestamp-Based Windows • Window at time t consists of all elements whose arrival timestamp is at least t’ = t-m • The number of elements in the window is not known in advance and may vary over time • None of the previous algorithms will work • All require windows with a constant, known number of elements

  16. Priority-Sample • We describe priority-sample for k=1 • Assign each element a randomly-chosen “priority” • The element with the highest priority is the sample • An element is ineligible if there is another element with a later timestamp and higher priority • Only store eligible, non-expired elements

  17. Memory Usage of Priority-Sample • Imagine that the elements were stored in a “treap” totally ordered by arrival timestamp and heap-ordered by priority • The eligible elements would represent the right spine of the treap • We only store the eligible elements • Therefore expected memory usage is O(log n), or O(k log n) for samples of size k • O(k log n) is also an upper bound (whp)

  18. Conclusion • Our contributions: • Introduced the problem of maintaining a sample over a moving window from a data stream • Developed the Chain-Sample algorithm for this problem with sequence-based windows • Developed the Priority-Sample algorithm for this problem with timestamp-based windows • Future work: • What else can be computed in sublinear space over moving windows on data streams? • For example: The next talk!

More Related