Adaptive Processing in Data Stream Systems

Adaptive Processing in Data Stream Systems Shivnath Babu Stanford University stanfordstreamdatamanager

Data Streams • New applications -- data as continuous, rapid, time-varying data streams • Sensor networks, RFID tags • Network monitoring and traffic engineering • Financial applications • Telecom call records • Web logs and click-streams • Manufacturing processes • Traditional databases -- data stored in finite, persistent data sets

Query Result Query … Result … Using Traditional Database User/Application Loader Table R Table S

Register Continuous Query Result Input streams New Approach for Data Streams User/Application Stream Query Processor

Example Continuous Queries • Web • Amazon’s best sellers over last hour • Network Intrusion Detection • Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” • Finance • Monitor NASDAQ stocks between $20 and $200 that have moved down more than 2% in the last 20 minutes

Stored Result Input Streams Archive Stored Tables Data Stream Management System (DSMS) Streamed Result Register Continuous Query Data Stream Management System (DSMS)

Primer on Database Query Processing Database System Preprocessing Declarative Query Canonical form Query Optimization Best query execution plan Results Data Query Execution

Which statistics Query are required Optimizer: Finds “best” query plan to process this query Estimated statistics Data, auxiliary structures, statistics Chosen query plan Executor: Runs chosen plan to completion Traditional Query Optimization Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms

Optimizing Continuous Queries is Challenging • Continuous queries are long-running • Stream properties can change while query runs • Data properties: value distributions • Arrival properties: bursts, delays • System conditions can change • Performance of a fixed plan can change significantly over time • Adaptive processing: use plan that is best for current conditions

Roadmap • StreaMon: Our adaptive query processing engine • Adaptive ordering of commutative filters • Adaptive caching for multiway joins • Current and future work • Similar techniques apply to conventional databases

Combined in part for efficiency Traditional Optimization  StreaMon Which statistics Query are required Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms Profiler: Monitors current stream and system characteristics Optimizer: Finds “best” query plan to process this query Re-optimizer: Ensures that plan is efficient for current characteristics Estimated statistics Chosen query plan Decisions to adapt Executor: Executes current plan on incoming stream tuples Executor: Runs chosen plan to completion

Bad packets Filter3 Filter2 Filter1 Packets Pipelined Filters • Commutative filters over a stream • Example: Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” • Simple to complex filters • Boolean predicates • Table lookups • Pattern matching • User-defined functions

Pipelined Filters: Problem Definition • Continuous Query: F1Æ F2 … Æ … Fn • Plan: Tuples  F(1) F(2)…  …  F(n) • Goal: Minimize expected cost to process a tuple

Pipelined Filters: Example 2 1 1 1 1 1 2 2 5 2 3 3 6 4 Input tuples Output tuples 4 7 7 8 F1 F2 F3 F4 Informal Goal: If tuple will be dropped, then drop it as cheaply as possible

Why is Our Problem Hard? • Filter drop-rates and costs can change over time • Filters can be correlated • E.g., Protocol = HTTP and DestPort = 80

Profiler Re-optimizer StreaMon Executor Metrics for an Adaptive Algorithm • Speed of adaptivity • Detecting changes and finding new plan • Run-time overhead • Re-optimization, collecting statistics, plan switching • Convergence properties • Plan properties under stable statistics

Pipelined Filters: Stable Statistics • Assume statistics are not changing • Order filters by decreasing drop-rate/cost [MS79,IK84,KBZ86,H94] • Correlations NP-Hard • Greedy algorithm: Use conditional statistics • F(1) has maximum drop-rate/cost • F(2) has maximum drop-rate/cost ratio for tuples not dropped by F(1) • And so on

Adaptive Version of Greedy • Greedy gives strong guarantees • 4-approximation, best poly-time approx. possible assuming P  NP [MBM+05] • For arbitrary (correlated) characteristics • Usually optimal in experiments • Challenge: • Online algorithm • Fast adaptivity to Greedy ordering • Low run-time overhead  A-Greedy: Adaptive Greedy

A-Greedy Which statistics are required Profiler:Maintains conditional filter drop-rates and costs over recent tuples Re-optimizer:Ensures that filter ordering is Greedy for current statistics Estimated statistics Changes in filter ordering Combined in part for efficiency Executor: Processes tuples with current Greedy ordering

A-Greedy’s Profiler • Responsible for maintaining current statistics • Filter costs • Conditional filter drop-rates: exponential! • Profile Window: Sampled statistics from which required conditional drop-rates can be estimated

Profile Window Profile Window 4 1 1 1 1 2 2 5 2 3 3 6 4 4 7 7 8 F1 F2 F3 F4 1 0 0 1 0 0 1 1 1 0 0 1 0 1 1 0

F1 F2 F3 F4 F3 F3 F1 F2 F4 F2 F1 F4 Greedy Ordering Using Profile Window F1 F2 F3 F4 Matrix View  Greedy Ordering

A-Greedy’s Re-optimizer • Maintains Matrix View over Profile Window • Easy to incorporate filter costs • Efficient incremental update • Fast detection/correction of changes in Greedy order  Details in [BMM+04]: “Adaptive Processing of Pipelined Stream Filters”, SIGMOD 2004

Next • Tradeoffs and variations of A-Greedy • Experimental results for A-Greedy

Tradeoffs • Suppose: • Changes are infrequent • Slower adaptivity is okay • Want best plans at very low run-time overhead • Three-way tradeoff among speed of adaptivity, run-time overhead, and convergence properties • Spectrum of A-Greedy variants

Variants of A-Greedy Matrix View Profile Window Matrix View

Variants of A-Greedy Matrix View

Experimental Setup • Implemented A-Greedy, Sweep, Local-Swaps, and Independent in StreaMon • Studied convergence properties, run-time overhead, and adaptivity • Synthetic testbed • Can control stream data and arrival properties • DSMS server running on 700 MHz Linux machine, 1 MB L2 cache, 2 GB memory

Converged Processing Rate Optimal-Fixed Sweep A-Greedy Local-Swaps Independent

Effect of Filter Drop-Rate Optimal-Fixed Sweep A-Greedy Local-Swaps Independent

Effect of Correlation Optimal-Fixed Sweep A-Greedy Local-Swaps Independent

Run-time Overhead

Adaptivity Permute selectivities here Progress of time (x1000 tuples processed)

Roadmap • StreaMon: Our adaptive processing engine • Adaptive ordering of commutative filters • Adaptive caching for multiway joins • Current and future work

observations in the last minute Stream Joins join results DSMS Sensor R Sensor S Sensor T

⋈T ⋈T ⋈S ⋈R ⋈R ⋈S MJoins (VNB04) Window on R Window on S Window on T

⋈T ⋈R Excessive Recomputation in MJoins Window on R Window on S Window on T

⋈ ⋈ Materializing Join Subexpressions Fully- materialized join subexpression Window on R Window on S Window on T

⋈ ⋈ Tree Joins: Trees of Binary Joins Fully-materialized join subexpression Window on S S Window on R Window on T R T

⋈ ⋈ ⋈ ⋈ WR WT ⋈ S R T Hard State Hinders Adaptivity WS WT ⋈ Plan switch R S T

⋈ ⋈ ⋈T ⋈S ⋈R ⋈R ⋈T ⋈S Can we get best of both worlds? MJoin Tree Join WR WT ⋈ S R T R S T • Less adaptive • Higher memory use • Recomputation

⋈T ⋈R ⋈ WR WT S tuple Cache MJoins + Caches Bypass pipeline segment Probe Window on R Window on S Window on T

MJoins + Caches (contd.) • Caches are soft state • Adaptive • Flexible with respect to memory usage • Captures whole spectrum from MJoins to Tree Joins and plans in between • Challenge: adaptive algorithm to choose join operator orders and caches in pipelines

Adaptive Caching (A-Caching) • Adaptive join ordering with A-Greedy or variant • Join operator orders  candidate caches • Adaptive selection from candidate caches • Adaptive memory allocation to chosen caches

A-Caching (caching part only) List of candidate caches Profiler: Estimates costs and benefits of candidate caches Re-optimizer:Ensures that maximum-benefit subset of candidate caches is used Estimated statistics Combined in part for efficiency Add/remove caches Executor: MJoins with caches

⋈ ⋈ ⋈ U T R S Performance of Stream-Join Plans (1) Arrival rates of streams are in the ratio 1:1:1:10, other details of input are given in [BMW+05]

Performance of Stream-Join Plans (2) Arrival rates of streams are in the ratio 15:10:5:1, other details of input are given in [BMW+05]

A-Caching: Results at a glance • Capture whole spectrum from Fully-pipelined MJoins to Tree-based joins adaptively • Approximation algorithms  scalable • Different types of caches • Up to 7x improvement with respect to MJoin and 2x improvement with respect to TreeJoin • Details in [BMW+05]: “Adaptive Caching for Continuous Queries”, ICDE 2005 (To appear)

Current and Future Work • Broadening StreaMon’s scope, e.g., • Shared computation among multiple queries • Parallelism • Rio: Adaptive query processing in conventional database systems • Plan logging: A new overall approach to address certain “meta issues” in adaptive processing

Related Work • Adaptive processing of continuous queries • E.g., Eddies [AH00], NiagaraCQ [CDT+00] • Adaptive processing in conventional databases • Inter-query adaptivity, e.g., Leo [SLM+01], [BC03] • Intra-query adaptivity, e.g., Re-Opt [KD98], POP [MRS+04] • New approaches to query optimization • E.g., parametric [GW89,INS+92,HS03], expected-cost based [CHS99,CHG02], error-aware [VN03]

Adaptive Processing in Data Stream Systems