1 / 12

Data-Streams and Histograms

Data-Streams and Histograms . Sudipto Guha, Nick Koudas & Kyuseok Shim. Background. Histogram Captures distribution statistics in an efficient manner Applications Query optimization Approximate query answering Data mining (time series in particular) Piecewise transmission of data

koto
Download Presentation

Data-Streams and Histograms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data-Streams and Histograms Sudipto Guha, Nick Koudas & Kyuseok Shim

  2. Background • Histogram • Captures distribution statistics in an efficient manner • Applications • Query optimization • Approximate query answering • Data mining (time series in particular) • Piecewise transmission of data • EquiWidth, EquiDepth, MHIST, MaxDiff, V-OPT

  3. Background • Data Stream • An ordered sequence of points that can be read only once or a small number of times • Applications • Mission critical network components • Dynamic traffic configuration, fault identification, troubleshooting • Performance of algorithm measured by number of passes algorithm must make over the stream

  4. Motivation • Since the end use of a histogram is to approximate a data distribution, why not use a near-optimal approximation of the best histogram if it means linear time computation?

  5. Motivation • Approximate V-OPT histograms by improving the dynamic programming solution from quadratic to linear time • Revised algorithm uses little space, hence suitable for data stream model • Assumes cost of interval is monotonic under inclusion

  6. Problem Statement • Given: • non-negative integers v1, ..., vn • k intervals or buckets to partition the index 1..n • Constraint: • Minimize k VARk where is the variance of values in the kth bucket • Dynamic Programming solution: • OPT[k, n] = min {OPT[k-1, x] + VAR[(x+1)..n]} • Runs in O(n2k) time with O(n) space x<n

  7. Intuition of Improvement • For a x  b, • VAR[a..n]  VAR[x..n]  VAR [b..n] (1) • OPT[a..n]  OPT[x..n]  OPT[b..n] (2) • Use this monotonicity property to reduce the search space by settling for an approximation • Instead of storing the whole OPT function, approximate it by a histogram!

  8. Intuition of Improvement • For all 1  p  k, maintain intervals (a1,b1),…, (al, bl) • Value of bi (1+)ai • The number of intervals l depends on p • The value for each interval substitutes for each value in the interval reducing space and time complexity

  9. Results • Theorem: A (1+) approximation for V-OPT runs in O((k2/)log n) space and time O((nk2/)log n) in the data stream model

  10. Advantages and Disadvantages • Accuracy/runtime tradeoff can be controlled by the parameter  • For data-stream model, alternatives abound: • Random sampling (simple, assumption of distribution) • Other histogram techniques (faster, less optimal) • Wavelet (flexibility) • Sliding Windows (later paper)

  11. Improvements

  12. Conclusion • The authors provided an algorithm for approximating a distribution that runs reasonably fast and with small space requirements • Proposed solution can be applied to data-stream model because values are not referred to unless they are stored

More Related