180 likes | 343 Views
Sampling From a Moving Window Over Streaming Data . Brian Babcock * Mayur Datar Rajeev Motwani. Stanford University. * Speaker. Continuous Data Streams. Data streams arise in a number of applications IP packets in a network Call records (telecom) Cash register data (retail sales)
E N D
Sampling From a Moving Window Over Streaming Data Brian Babcock*Mayur DatarRajeev Motwani Stanford University *Speaker
Continuous Data Streams • Data streams arise in a number of applications • IP packets in a network • Call records (telecom) • Cash register data (retail sales) • Sensor networks • Large volumes of data • Online processing • Data is read once and discarded • Memory is limited
Why Moving Windows? • Timeliness matters • Old/obsolete data is not useful • Scalability matters • Querying the entire history may be impractical • Solution: restrict queries to a window of recent data • As new data arrives, old data “expires” • Addresses timeliness and scalability
Two Types of Windows • Sequence-Based • The most recent n elements from the data stream • Assumes a (possibly implicit) sequence number for each element • Timestamp-Based • All elements from the data stream in the last m units of time (e.g. last 1 week) • Assumes a (possibly implicit) arrival timestamp for each element • Sequence-based is the focus for most of the talk
Sampling From a Data Stream • Inputs: • Sample size k • Window size n >> k (alternatively, time duration m) • Stream of data elements that arrive online • Output: • k elements chosen uniformly at random from the last n elements (alternatively, from all elements that have arrived in the last m time units) • Goal: • maintain a data structure that can produce the desired output at any time upon request
A Simple, Unsatisfying Approach • Choose a random subset X={x1, …,xk}, X{0,1,…,n-1} • The sample always consists of the non-expired elements whose indexes are equal to x1, …,xk (modulo n) • Only uses O(k) memory • Technically produces a uniform random sample of each window, but unsatisfying because the sample is highly periodic • Unsuitable for many real applications, particularly those with periodicity in the data
Another Simple Approach: Oversample • As each element arrives remember it with probability p = ck/n log n; otherwise discard it • Discard elements when they expire • When asked to produce a sample, choose k elements at random from the set in memory • Expected memory usage of O(k log n) • Uses O(k log n) memory whp • The algorithm can fail if less than k elements from a window are remembered; however whp this will not happen
Reservoir Sampling • Classic online algorithm due to Vitter (1985) • Maintains a fixed-size uniform random sample • Size of the data stream need not be known in advance • Data structure: “reservoir” of k data elements • As the ith data element arrives: • Add it to the reservoir with probability p = k/i, discarding a randomly chosen data element from the reservoir to make room • Otherwise (with probability 1-p) discard it
Why It Doesn’t Work With Moving Windows • Suppose an element in the reservoir expires • Need to replace it with a randomly-chosen element from the current window • However, in the data stream model we have no access to past data • Could store the entire window but this would require O(n) memory
Chain-Sample • Include each new element in the sample with probability 1/min(i,n) • As each element is added to the sample, choose the index of the element that will replace it when it expires • When the ith element expires, the window will be (i+1…i+n), so choose the index from this range • Once the element with that index arrives, store it and choose the index that will replace it in turn, building a “chain” of potential replacements • When an element is chosen to be discarded from the sample, discard its “chain” as well
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 35 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 Example 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
{ 0 for x < 0 1 + 1/n [ΣT(j)] for x 1 j<i Memory Usage of Chain-Sample • Let T(x) denote the expected length of the chain from the element with index i when the most recent index is i+x • T(x) = • The expected length of each chain is less than T(n) e 2.718 • Expected memory usage is O(k)
Memory Usage of Chain-Sample • Chain consists of “hops” with lengths 1…n • Chain of length j can be represented by partition of n into j ordered integer parts • j-1 hops with sum less than n plus a remainder • Each such partition has probability n-j • Number of such partitions is (n) < (ne/j)j • Probability of any such partition is small [O(n-c)]when j = O(k log n) • Uses O(k log n) memory whp j
Comparison of Algorithms • Chain-sample is preferable to oversampling: • Better expected memory usage: O(k) vs. O(k log n) • Same high-probability memory bound of O(k log n) • No chance of failure due to sample size shrinking below k
Timestamp-Based Windows • Window at time t consists of all elements whose arrival timestamp is at least t’ = t-m • The number of elements in the window is not known in advance and may vary over time • None of the previous algorithms will work • All require windows with a constant, known number of elements
Priority-Sample • We describe priority-sample for k=1 • Assign each element a randomly-chosen “priority” • The element with the highest priority is the sample • An element is ineligible if there is another element with a later timestamp and higher priority • Only store eligible, non-expired elements
Memory Usage of Priority-Sample • Imagine that the elements were stored in a “treap” totally ordered by arrival timestamp and heap-ordered by priority • The eligible elements would represent the right spine of the treap • We only store the eligible elements • Therefore expected memory usage is O(log n), or O(k log n) for samples of size k • O(k log n) is also an upper bound (whp)
Conclusion • Our contributions: • Introduced the problem of maintaining a sample over a moving window from a data stream • Developed the Chain-Sample algorithm for this problem with sequence-based windows • Developed the Priority-Sample algorithm for this problem with timestamp-based windows • Future work: • What else can be computed in sublinear space over moving windows on data streams? • For example: The next talk!