Data-Streams and Histograms

Data-Streams and Histograms Sudipto Guha, Nick Koudas & Kyuseok Shim

Background • Histogram • Captures distribution statistics in an efficient manner • Applications • Query optimization • Approximate query answering • Data mining (time series in particular) • Piecewise transmission of data • EquiWidth, EquiDepth, MHIST, MaxDiff, V-OPT

Background • Data Stream • An ordered sequence of points that can be read only once or a small number of times • Applications • Mission critical network components • Dynamic traffic configuration, fault identification, troubleshooting • Performance of algorithm measured by number of passes algorithm must make over the stream

Motivation • Since the end use of a histogram is to approximate a data distribution, why not use a near-optimal approximation of the best histogram if it means linear time computation?

Motivation • Approximate V-OPT histograms by improving the dynamic programming solution from quadratic to linear time • Revised algorithm uses little space, hence suitable for data stream model • Assumes cost of interval is monotonic under inclusion

Problem Statement • Given: • non-negative integers v1, ..., vn • k intervals or buckets to partition the index 1..n • Constraint: • Minimize k VARk where is the variance of values in the kth bucket • Dynamic Programming solution: • OPT[k, n] = min {OPT[k-1, x] + VAR[(x+1)..n]} • Runs in O(n2k) time with O(n) space x<n

Intuition of Improvement • For a x  b, • VAR[a..n]  VAR[x..n]  VAR [b..n] (1) • OPT[a..n]  OPT[x..n]  OPT[b..n] (2) • Use this monotonicity property to reduce the search space by settling for an approximation • Instead of storing the whole OPT function, approximate it by a histogram!

Intuition of Improvement • For all 1  p  k, maintain intervals (a1,b1),…, (al, bl) • Value of bi (1+)ai • The number of intervals l depends on p • The value for each interval substitutes for each value in the interval reducing space and time complexity

Results • Theorem: A (1+) approximation for V-OPT runs in O((k2/)log n) space and time O((nk2/)log n) in the data stream model

Advantages and Disadvantages • Accuracy/runtime tradeoff can be controlled by the parameter  • For data-stream model, alternatives abound: • Random sampling (simple, assumption of distribution) • Other histogram techniques (faster, less optimal) • Wavelet (flexibility) • Sliding Windows (later paper)

Improvements

Conclusion • The authors provided an algorithm for approximating a distribution that runs reasonably fast and with small space requirements • Proposed solution can be applied to data-stream model because values are not referred to unless they are stored

Data-Streams and Histograms