1 / 58

Adaptive Processing in Data Stream Systems

Adaptive Processing in Data Stream Systems. Shivnath Babu. Stanford University. st anfordst re amdat am anager. Data Streams. New applications -- data as continuous, rapid, time-varying data streams Sensor networks, RFID tags Network monitoring and traffic engineering

Download Presentation

Adaptive Processing in Data Stream Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Processing in Data Stream Systems Shivnath Babu Stanford University stanfordstreamdatamanager

  2. Data Streams • New applications -- data as continuous, rapid, time-varying data streams • Sensor networks, RFID tags • Network monitoring and traffic engineering • Financial applications • Telecom call records • Web logs and click-streams • Manufacturing processes • Traditional databases -- data stored in finite, persistent data sets

  3. Query Result Query … Result … Using Traditional Database User/Application Loader Table R Table S

  4. Register Continuous Query Result Input streams New Approach for Data Streams User/Application Stream Query Processor

  5. Example Continuous Queries • Web • Amazon’s best sellers over last hour • Network Intrusion Detection • Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” • Finance • Monitor NASDAQ stocks between $20 and $200 that have moved down more than 2% in the last 20 minutes

  6. Stored Result Input Streams Archive Stored Tables Data Stream Management System (DSMS) Streamed Result Register Continuous Query Data Stream Management System (DSMS)

  7. Primer on Database Query Processing Database System Preprocessing Declarative Query Canonical form Query Optimization Best query execution plan Results Data Query Execution

  8. Which statistics Query are required Optimizer: Finds “best” query plan to process this query Estimated statistics Data, auxiliary structures, statistics Chosen query plan Executor: Runs chosen plan to completion Traditional Query Optimization Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms

  9. Optimizing Continuous Queries is Challenging • Continuous queries are long-running • Stream properties can change while query runs • Data properties: value distributions • Arrival properties: bursts, delays • System conditions can change • Performance of a fixed plan can change significantly over time • Adaptive processing: use plan that is best for current conditions

  10. Roadmap • StreaMon: Our adaptive query processing engine • Adaptive ordering of commutative filters • Adaptive caching for multiway joins • Current and future work • Similar techniques apply to conventional databases

  11. Combined in part for efficiency Traditional Optimization  StreaMon Which statistics Query are required Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms Profiler: Monitors current stream and system characteristics Optimizer: Finds “best” query plan to process this query Re-optimizer: Ensures that plan is efficient for current characteristics Estimated statistics Chosen query plan Decisions to adapt Executor: Executes current plan on incoming stream tuples Executor: Runs chosen plan to completion

  12. Bad packets Filter3 Filter2 Filter1 Packets Pipelined Filters • Commutative filters over a stream • Example: Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” • Simple to complex filters • Boolean predicates • Table lookups • Pattern matching • User-defined functions

  13. Pipelined Filters: Problem Definition • Continuous Query: F1Æ F2 … Æ … Fn • Plan: Tuples  F(1) F(2)…  …  F(n) • Goal: Minimize expected cost to process a tuple

  14. Pipelined Filters: Example 2 1 1 1 1 1 2 2 5 2 3 3 6 4 Input tuples Output tuples 4 7 7 8 F1 F2 F3 F4 Informal Goal: If tuple will be dropped, then drop it as cheaply as possible

  15. Why is Our Problem Hard? • Filter drop-rates and costs can change over time • Filters can be correlated • E.g., Protocol = HTTP and DestPort = 80

  16. Profiler Re-optimizer StreaMon Executor Metrics for an Adaptive Algorithm • Speed of adaptivity • Detecting changes and finding new plan • Run-time overhead • Re-optimization, collecting statistics, plan switching • Convergence properties • Plan properties under stable statistics

  17. Pipelined Filters: Stable Statistics • Assume statistics are not changing • Order filters by decreasing drop-rate/cost [MS79,IK84,KBZ86,H94] • Correlations NP-Hard • Greedy algorithm: Use conditional statistics • F(1) has maximum drop-rate/cost • F(2) has maximum drop-rate/cost ratio for tuples not dropped by F(1) • And so on

  18. Adaptive Version of Greedy • Greedy gives strong guarantees • 4-approximation, best poly-time approx. possible assuming P  NP [MBM+05] • For arbitrary (correlated) characteristics • Usually optimal in experiments • Challenge: • Online algorithm • Fast adaptivity to Greedy ordering • Low run-time overhead  A-Greedy: Adaptive Greedy

  19. A-Greedy Which statistics are required Profiler:Maintains conditional filter drop-rates and costs over recent tuples Re-optimizer:Ensures that filter ordering is Greedy for current statistics Estimated statistics Changes in filter ordering Combined in part for efficiency Executor: Processes tuples with current Greedy ordering

  20. A-Greedy’s Profiler • Responsible for maintaining current statistics • Filter costs • Conditional filter drop-rates: exponential! • Profile Window: Sampled statistics from which required conditional drop-rates can be estimated

  21. Profile Window Profile Window 4 1 1 1 1 2 2 5 2 3 3 6 4 4 7 7 8 F1 F2 F3 F4 1 0 0 1 0 0 1 1 1 0 0 1 0 1 1 0

  22. F1 F2 F3 F4 F3 F3 F1 F2 F4 F2 F1 F4 Greedy Ordering Using Profile Window F1 F2 F3 F4 Matrix View  Greedy Ordering

  23. A-Greedy’s Re-optimizer • Maintains Matrix View over Profile Window • Easy to incorporate filter costs • Efficient incremental update • Fast detection/correction of changes in Greedy order  Details in [BMM+04]: “Adaptive Processing of Pipelined Stream Filters”, SIGMOD 2004

  24. Next • Tradeoffs and variations of A-Greedy • Experimental results for A-Greedy

  25. Tradeoffs • Suppose: • Changes are infrequent • Slower adaptivity is okay • Want best plans at very low run-time overhead • Three-way tradeoff among speed of adaptivity, run-time overhead, and convergence properties • Spectrum of A-Greedy variants

  26. Variants of A-Greedy Matrix View Profile Window Matrix View

  27. Variants of A-Greedy Matrix View

  28. Experimental Setup • Implemented A-Greedy, Sweep, Local-Swaps, and Independent in StreaMon • Studied convergence properties, run-time overhead, and adaptivity • Synthetic testbed • Can control stream data and arrival properties • DSMS server running on 700 MHz Linux machine, 1 MB L2 cache, 2 GB memory

  29. Converged Processing Rate Optimal-Fixed Sweep A-Greedy Local-Swaps Independent

  30. Effect of Filter Drop-Rate Optimal-Fixed Sweep A-Greedy Local-Swaps Independent

  31. Effect of Correlation Optimal-Fixed Sweep A-Greedy Local-Swaps Independent

  32. Run-time Overhead

  33. Adaptivity Permute selectivities here Progress of time (x1000 tuples processed)

  34. Roadmap • StreaMon: Our adaptive processing engine • Adaptive ordering of commutative filters • Adaptive caching for multiway joins • Current and future work

  35. observations in the last minute Stream Joins join results DSMS Sensor R Sensor S Sensor T

  36. ⋈T ⋈T ⋈S ⋈R ⋈R ⋈S MJoins (VNB04) Window on R Window on S Window on T

  37. ⋈T ⋈R Excessive Recomputation in MJoins Window on R Window on S Window on T

  38. ⋈ Materializing Join Subexpressions Fully- materialized join subexpression Window on R Window on S Window on T

  39. ⋈ Tree Joins: Trees of Binary Joins Fully-materialized join subexpression Window on S S Window on R Window on T R T

  40. ⋈ ⋈ ⋈ WR WT ⋈ S R T Hard State Hinders Adaptivity WS WT ⋈ Plan switch R S T

  41. ⋈ ⋈T ⋈S ⋈R ⋈R ⋈T ⋈S Can we get best of both worlds? MJoin Tree Join WR WT ⋈ S R T R S T • Less adaptive • Higher memory use • Recomputation

  42. ⋈T ⋈R ⋈ WR WT S tuple Cache MJoins + Caches Bypass pipeline segment Probe Window on R Window on S Window on T

  43. MJoins + Caches (contd.) • Caches are soft state • Adaptive • Flexible with respect to memory usage • Captures whole spectrum from MJoins to Tree Joins and plans in between • Challenge: adaptive algorithm to choose join operator orders and caches in pipelines

  44. Adaptive Caching (A-Caching) • Adaptive join ordering with A-Greedy or variant • Join operator orders  candidate caches • Adaptive selection from candidate caches • Adaptive memory allocation to chosen caches

  45. A-Caching (caching part only) List of candidate caches Profiler: Estimates costs and benefits of candidate caches Re-optimizer:Ensures that maximum-benefit subset of candidate caches is used Estimated statistics Combined in part for efficiency Add/remove caches Executor: MJoins with caches

  46. ⋈ ⋈ U T R S Performance of Stream-Join Plans (1) Arrival rates of streams are in the ratio 1:1:1:10, other details of input are given in [BMW+05]

  47. Performance of Stream-Join Plans (2) Arrival rates of streams are in the ratio 15:10:5:1, other details of input are given in [BMW+05]

  48. A-Caching: Results at a glance • Capture whole spectrum from Fully-pipelined MJoins to Tree-based joins adaptively • Approximation algorithms  scalable • Different types of caches • Up to 7x improvement with respect to MJoin and 2x improvement with respect to TreeJoin • Details in [BMW+05]: “Adaptive Caching for Continuous Queries”, ICDE 2005 (To appear)

  49. Current and Future Work • Broadening StreaMon’s scope, e.g., • Shared computation among multiple queries • Parallelism • Rio: Adaptive query processing in conventional database systems • Plan logging: A new overall approach to address certain “meta issues” in adaptive processing

  50. Related Work • Adaptive processing of continuous queries • E.g., Eddies [AH00], NiagaraCQ [CDT+00] • Adaptive processing in conventional databases • Inter-query adaptivity, e.g., Leo [SLM+01], [BC03] • Intra-query adaptivity, e.g., Re-Opt [KD98], POP [MRS+04] • New approaches to query optimization • E.g., parametric [GW89,INS+92,HS03], expected-cost based [CHS99,CHG02], error-aware [VN03]

More Related