230 likes | 260 Views
Explore the challenges and solutions for maintaining time-decaying stream aggregates efficiently in large datasets. Learn about decay functions, time-decaying sums, averages, and various decay models. Discover practical applications in networking, IP routing, and customer usage statistics. Delve into algorithms and techniques such as Exponential Decay, Sliding Windows, Polynomial Decay, and General Decay Functions. Gain insights into implementing efficient decay tracking methods and approximations for managing data streams effectively while considering decay over time.
E N D
Maintaining Time-Decaying Stream Aggregates Edith Cohen Martin Strauss AT&T Labs-research PODS 2003
The Problem • A data stream is a sequence of data items observed over time. • Presence of multiple massive data streams. • Storage constraints allow only to maintain a compact summary of the “essence” of information in each stream. • Relevance of information decays with time. • Thus, when aggregating across time, older information should be discounted. PODS 2003
Applications • IP routing - RED protocol: time-decayed average of previous queue lengths is used to estimate impending congestion at router • Internet gateway selection: tracks the quality (eg packet loss rate) of alternative paths to select a more reliable one. • Usage statistics of phone customers: AT&T has about 100M customers. • More ….. PODS 2003
Decay Functions • A decay function is non-increasing g(x)>=0 defined for x>=1. • f(t) >=0 is the value of the data item observed at time t. • The weight at time T of an item obtained at time t is g(T-t) • The decayed value of the item is f(t)g(T-t) PODS 2003
Time-Decaying Sum • When f(t) are 0/1 we refer to the problem as time-decaying count. • Maintaining the decaying sum exactly can generally consume linear bits. • We consider approximately maintaining it to within PODS 2003
Maintaining time-decaying average reduces to maintaining two time-decaying sums Time-Decaying Average • Time-decaying weighted average of observed values. • is the value of item observed at time PODS 2003
Exponential decay [Jacobson 88] • Sliding Windows [DGIM02] • g(x)=1 for x<W • g(x)=0 otherwise • Polynomial decay Interesting Families of Decay Functions • General Decay functions… PODS 2003
Lemma: • Exact tracking requires storage bits • Approximate tracking uses bits Exponential Decay • Used in networking applications (RED) • Very simple maintenance: PODS 2003
Sliding Window Decay Lemma: [DGIM02] Sliding window decay can be approximately tracked using bits (for 0/1 or poly size values). • “Sharp Threshold” • Upper bound using the Exponential Histogram (EH) technique. PODS 2003
Polynomial Decay Lemma: Lower bound: Upper bound: (N is elapsed time) • Often more appropriate to applications than Exponential or Sliding Window decay • More efficient than SliWin decay (nearly quadratic gap), almost as efficient as Exponential decay. PODS 2003
Algorithm based on an adaptation of the Exponential Histograms technique. • Sliding windows, (with ), [DGIM02] are as “hard” to maintain as general decay General Decay Functions • Lemma: Can be (approximately) maintained using bits (N is minimum of elapsed time and min x for which g(x)=0 ) PODS 2003
Time t0 good Which link should we select past time t0? bad Initially A or B, eventually B. Why Polynomial Decay? • Link performance over time Link A Link B PODS 2003
Poly decay can model our expectation (also other smooth subexponential functions…) Link Selection Example) cont) • Polynomial decay (by tuning parameter): Initially A or B, eventually B. • Exponential decay: Constant relative value of A and B: Either A forever or B forever • Sliding Window decay: First B then A then same… PODS 2003
Approximate to within Summary of Bounds • N is minimum of elapsed time and min x for • which g(x)=0 PODS 2003
Time Time width: 4 Count: 2 Time width: 3 Count: 2 Time width: 3 Count: 1 Time width: 7 Count: 4 Merge Bucketing the Stream 1 0 0 1 1 0 1 0 0 1 • Histogram determined by time boundaries and bucket counts • Time boundaries can be fixed (counts maintained per stream) • Counts can be fixed (time boundaries maintained per stream) PODS 2003
Bucket counts are independent of stream • Sum of bucket counts is a constant-factor approximation for Exponential Histograms [DGIM02] • Introduced for Sliding Windows • Each new item is placed in a new bucket. • Two buckets are merged when their combined count is at most a fraction of the combined count of all earlier buckets. • Buckets with start time greater than W are discarded. PODS 2003
Exponential Histograms (cont) • Example for factor 2 approximation: (bucket counts) • 1 • 1, 1 • 1, 1, 1 • 1, 1, 2 (merge) • 1, 1, 1, 2 • 1, 1, 2, 2 (merge) • Values with time “in question” (before or after W) are aggregated in least recent bucket. PODS 2003
EHs properties • Number of buckets is O(log W), for each bucket we need to record exact start time, thus we need O(log W) storage per bucket. (total is O(log^2 W)) • An EH for Sliding Window W can be used to approximate Sliding Window j for all j<W Lemma: EH can be used to approximate general decay functions. (With W= minimum of elapsed time and min x for which g(x)=0.) PODS 2003
With an EH with W=N we can compute (approximately) decayed sums according to all decay functions g() up to elapsed time N (or forever if g(N)=0). From (approximate) for all W<=N we can compute (approximate) decayed sum according to g(). Reducing any Decay Function to Sliding Windows. • Decay function g(x) PODS 2003
O(log N log log N) storage for polynomial decay Weight-Based Merging • Bucket start times depend only on elapsed time. • WBM Histograms applies to decay functions where g(x)/g(x+1) is non-increasing. • Number of buckets is O(log(g(1)/g(N))). • O(log log N) storage per bucket (for approximate bucket counts). • More efficient than EH on decay that is slightly super-polynomial or slower. PODS 2003
At most 2 buckets per region WBM Histograms – How? • Region boundariesb1,b2,b3,… : • Current most-recent bucket is sealed and new bucket is started at T s.t. T mod b1=0 • Two consecutive buckets that are in the same region (according to elapsed start and end times) are merged. PODS 2003
T=1 T=2 T=3 T=4 T=5 T=6 WBMH Example • g(x)=1/x, (1+e)=2 • Regions: 1,1/2, 1/3,1/4,1/5,1/6, 1/7,1/8,…,1/14 PODS 2003
Conclusion • Summary: • Efficient computation of time-decayed sum/averages for general decay functions. • Very efficient computation for polynomial decay • Open question: • O(log n) storage for polynomial decay • Subsequent related work: • Spatial decay (sensor nets/p2p nets) PODS 2003