1 / 43

Stream Data Introduction

Stream Data Introduction. Outline. Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows algorithms References. Sizing the challenge. WalMart Records 20 Million Transactions Google Handles 100 Million Searches

msaldana
Download Presentation

Stream Data Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stream Data Introduction

  2. Outline • Streaming Data description • Uses/Applications • Problems/Challenges • Main Concepts • variance & k-means • aging & sliding windows • algorithms • References

  3. Sizing the challenge • WalMart Records 20 Million Transactions • Google Handles 100 Million Searches • AT&T produces 275 million call records • Earth sensing satellite produces GBs of data This just in a day!

  4. Characteristics/Description • Stream data sets are… • Continuous • Massive • Unbounded • Possibly infinite • Fast changing and requires fast, real-time response

  5. Example: Network Management Application • Network Management involves monitoring and configuring network hardware and software to ensure smooth operation • Monitor link bandwidth usage, estimate traffic demands • Quickly detect faults, congestion and isolate root cause • Load balancing, improve utilization of network resources AT&T collects 100 GBs of NetFlow data each day!

  6. Network Management Application (cont.) Network Operations Center Measurements Alarms Network

  7. Uses/Applications • Banking/Stocks/Financials • credit card fraud detection • stock trends monitoring • Sensors • power grid balancing • engine controls • collision avoidance • driver sleep monitor

  8. Problems/Challenges • ‘Zillions of data • Continuous/Unbounded • Examples arrive faster than they can be mined • Application may require fast, real-time response • Examples: life threatening: collision avoidance lost revenue/transactions: hung-up networks

  9. Problems/Challenges • Time/Space constrained • Not enough memory • Can’t afford storing/revisiting the data • Single pass computation • External memory algorithms for handling data sets larger than main memory cannot be used. • Do not support continuous queries • Too slow real-time response

  10. Problems/Challenges • In summary… • Can’t stop to smell the roses… • Only one chance/single pass/look at the data

  11. Problems/Challenges • Other Considerations • Classical algorithms (i.e. CART, C4.5) do not scale up to data stream [DH00] • Most need entire data set for analysis • Random access (or multiple passes) to the data • Difficult to compute answers accurately with limited memory • With probability at least 1 - , algorithms compute an approximate answer within a factor  of the actual answer • Noise (bad sensors, outliers) • Aging/Old/Stale data

  12. Computation Model Synopsis in Memory Data Streams Stream Processing Engine (Approximate) Answer Decision Making

  13. Model Components • Synopsis • Summary of the data • Samples, Histograms • Processing Engine • Implementation/Management System • STREAM (Stanford): general-purpose • Aurora (Brown/MIT): sensor monitoring, dataflow • Telegraph (Berkeley): adaptive engine for sensors • Decision Making • Apply Data Mining techniques • Decision Trees, Clusters, Association Rules

  14. Synopsis: Dealing with Time/Space Constraints • Since data can’t be contained, or revisited, the best alternative is to summarize what has been seen. • Basic stream synopsis computation • Random Sampling: Generate statistics using a representative sample of the data • Histograms: Distribution/Grouping data representation • Wavelets: Mathematical tool for hierarchical decomposition of functions/signals

  15. Reservoir Sampling • Sample first m items • Choose to sample the i’th item (i>m) with probability m/i • If sampled, randomly replace a previously sampled item • Optimization: when i gets large, compute which item will be sampled next, skip over intervening items

  16. Reservoir Sampling - Analysis 1 i i+1n-2n-1  …  i i+1 i+2 n-1 n • Analyze simple case: sample size m = 1 • Probability i’th item is the sample from stream length n: • Prob. i is sampled on arrival  prob. i survives to end = 1/n • Case for m > 1 is similar, easy to show uniform probability • Drawbacks of reservoir sampling: hard to parallelize

  17. Min-wise Sampling • For each item, pick a random fraction between 0 and 1 • Store item(s) with the smallest random tag [Nath et al.’04] 0.391 0.908 0.291 0.555 0.619 0.273 • Each item has same chance of least tag, so uniform • Can run on multiple streams separately, then merge

  18. Histograms • Histograms approximate the frequency distribution of element values in a stream • A histogram (typically) consists of • A partitioning of element domain values into buckets • A count per bucket B (of the number of elements in B) • Long history of use for selectivity estimation within a query optimizer

  19. Histogram Sampling • Equi-Depth • Element counts per bucket are kept constant • V-Optimal • Minimize frequency variance within buckets • Exponential Histograms (EH) • Bucket sizes are non-decreasing powers of 2 • Size: Total number of 1’s in the bucket. • For every bucket other than the last bucket, there are at least k/2 and at most k/2+1 buckets of that size • Example: k=4: (1,1,2,2,2,4,4,4,8,8,..) • Essential component of “sliding windows” technique addressing “aging” data.

  20. Equi-Depth • V-Optimal • Exponential Histograms

  21. Exponential Histogram Assume k/2 = 2 32,16,8,8,4,4,2,1,1

  22. Exponential Histogram Assume k/2 = 2 32,16,8,8,4,4,2,1,1 1

  23. Exponential Histogram Assume k/2 = 2 32,16,8,8,4,4,2,1,1,1 32,16,8,8,4,4,2,2,1 Merged! Merge!

  24. Exponential Histogram Assume k/2 = 2 32,16,8,8,4,4,2,1,1 32,16,8,8,4,4,2,2,1 32,16,8,8,4,4,2,2,1,1 32,16,16,8,4,2,1

  25. Exponential Histogram Assume k/2 = 2 32,16,8,8,4,4,2,1,1 32,16,8,8,4,4,2,2,1 32,16,8,8,4,4,2,2,1,1 32,16,16,8,4,2,1

  26. Answering Queries using Histograms answer: 3.5 * • (Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation • Example: select count(*) from R where 4<=R.e<=15 • For equi-depth histograms, maximum error: Count spread evenly among bucket values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 4  R.e  15

  27. Sliding Windows Technique • Background: • Some applications rely on ALL historical data • But for most applications, OLD data is considered less relevant and could skew results from NEW trends or conditions • new processes/procedures • new hardware/sensors • new fashion trends

  28. Sliding Windows (cont.) • Common approaches addressing Old data: • Aging Model • elements are associated with “weights” that decrease over time • may use some exponential decay formulas • Sliding Windows Model • Only last “N” elements are considered • Incorporate examples as they arrive • The record “expires” at time t+N (N is the window length) • Count only the “1’s” in bit-stream data

  29. Sliding Window (SW) Model Time Increases ….1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 0 1 0 1 0 0 1 1… Window Size N = 7 Current Time

  30. Sliding Windows Plus Exponential Histograms • Sliding Windows Approach (pseudo-pseudo code) • Consider only the last N elements. • Define k=1/ε, and approximate k/2 to nearest integer. • Time Stamp each “1” that arrives in the stream and insert into a first bucket, shifting any initial ones. • First bucket value is “1” since there is only one “1” • If the number of buckets with same value exceeds k/2 +1, merge the oldest buckets, but keeping at least k/2 buckets of the same value • Merging creates a new bucket with size equal to the sum • Eliminate last bucket if its last 1 time stamp exceeds N

  31. Benefits of Sliding Windows • Incorporates new elements as they appear. • Easy to calculate statistics over data streams with respect to the last N elements based on the histogram. • Can estimate the number of 1’s within a factor of (1 + ε) using only θ((1/ε)(log2N)) bits of memory.

  32. Expansion of Sliding Windows • The original Sliding Window Method was not fully applicable to two important statistics during the “merging” of the buckets: • k-median and variance • A solution was devised by Babcock, Datar, Motwani and O’Callaghan • Their work derived a methodology for Variance, that was also applied for k-medians.

  33. Variance and k-Medians • Variance: Σ(xi – μ)2, μ = Σ xi/N • k-median clustering: • Given: N points (x1… xN) in a metric space • Find k points C = {c1, c2, …, ck} that minimize Σ d(xi, C) (the assignment distance) • Clustering to be covered in detail future presentation

  34. Notation Current window, size = N ……………… Bm-1 Bm B2 B1 Vi = Variance of the ith bucket ni = number of elements in ith bucket μi = mean of the ith bucket

  35. Variance – composition • Bi,j = concatenation of buckets i and j

  36. Decision Making • The problem of addressing time changing data had also significant influence on decision algorithms. • Pedro Domingos, who had originally developed a successful decision table algorithm (VFDT), also conceptualized the need to work with recent data, resulting in a new algorithm known as CVFDT. • VFDT - Very Fast Decision Tree • CVFDT - Concept Drift Very Fast Decision Tree • Implemented a window approach

  37. Decision Making • Both VFDT and CVFDT make use of a statistical result known as Hoeffding* bound • Used to estimate the minimum number of necessary examples needed to make a decision for a node in a decision tree. • This is the key concept for these algorithms to work. * W.Hoefding, Probability Inequalities sums bounded Variables, Journal American Statistics Association, 1963

  38. Hoeffding Bound • random variable a whose range is R • n independent observations of a; Mean: ā • Hoeffding bound states: With probability 1- , the true mean of a is at least ā - , where

  39. Hoeffding Bound • Significance… • This estimate/bound is incorporated into an ID3 type decision tree, hence VFDT/CVFDT • The information gain is evaluated against 

  40. VFDT Algorithm

  41. VFDT Algorithm Results

  42. CVFDT vs. VFDT • CVFDT is an extension to VFDT that incorporated “windowing” • CFVDT concept: • Generate tree as regular but using a window of “w” elements. • Monitor changes in gain for attributes. • If changes, generate alternate subtree with new “best” attribute, but keep on background. • Replace if new subtree becomes more accurate.

  43. References • [BDMO03] B. Babcock, M. Datar, R. Motwani, and J. L. O’Callaghan. “Maintaining Variance and k-Medians over Data Stream Windows”. ACM PODS, 2003. http://citeseer.nj.nec.com/591910.html http://www.stanford.edu/~babcock/papers/pods03.ppt • [DH00] P. Domingos and G. Hulten. “Mining High-Speed Data Streams”. ACM KDD, 2000. http://citeseer.nj.nec.com/domingos00mining.html • [HSD01] G. Hulten, L. Spencer and P. Domingos. “Mining Time-Changing Data Streams”. ACM KDD, 2001. http://citeseer.nj.nec.com/hulten01mining.html • [DGIM02] Mayur Datar, Aristides Gionis, Piotr Indyk and Rajeev Motwani. “Maintaining Stream Statistics over Sliding Windows” ACM-SIAM SODA 2002. http://www.stanford.edu/~babcock/papers/pods03.ppt • [GGR02] Minos Garofalakis, Johannes Gehrke and Rajeev Rastogi. “Querying and Mining Data Streams: You Only Get One Look”. SIGMOD 2002 (tutorial). http://www.bell-labs.com/user/minos/Talks/streams-tutorial02.ppt

More Related