Run-Time Operator State Spilling for Memory Intensive Long-Running Queries

Run-Time Operator State Spilling for Memory Intensive Long-Running Queries Database Systems Research Laboratory Worcester Polytechnic Institute Bin Liu, Yali Zhu and Elke A. Rundensteiner SIGMOD'06

Decision-Make Applications ... Analyze relationship among stock price, reports, and news? Decision Support System A equi-Join of stock price, reports, and news on stock symbols Complex queries such as multi-joins are common! Motivating Example Real Time Data Integration Server Stock Price, Volumes,... ... ... Reviews, External Reports, News, ... • Produce As Many Results As Possible at Run-Time • (i.e., 9:00am-4:00pm) • Require Complete Query Results • (i.e., for offline analysis after 4:00pm) SIGMOD'06

a4 b1 a4 b2 a2 b3 State A State B a1 b1 A B a2 b2 a3 A B Challenges • As Many Run-Time Results As Possible • Demand main memory based query processing • Push-Based Processing with Complex Queries • Demand main memory space to store operator states • Operator states may monotonically increase over time • Run-Time Main Memory Overflow? b3 a4 a4 b3 SIGMOD'06

Problem : Memory Overflow • High Demand on Main Memory : • High input rates and large windows result in huge states • Bursty streams cause temporary accumulation of tuples • Long-running queries exhibit monotonic state increases • Potential Solutions : • Query Optimization • Distributed Processing • Load Shedding • Memory Management SIGMOD'06

A B C Secondary Storage A B C A B C • New incoming tuples processed only against partial states State Spill • Push Operator States Temporarily into Disks • Operator states spilled are temporarily inactive SIGMOD'06

State of Art : State Flushing • Three-staged Processing : Hash • Xjoin [UF00] • Two Algorithms : Hash + Merge • Hash-Merge Join [MLA04] • Single-input, Distributed Environment • Flux [SHCF03] Observation: Single OperatorFocus !!! SIGMOD'06

? • Increase memory consumption of Join2: • May quickly fill main memory • May require state spill again • Causes more work downstream Join2 D Join1 • But states in Join2 may not contribute to final output : • Low selectivity A B C Problem : What about Multi-Operator Plans ? • Observation: • Interdependency among Pipelined Operators • Spilling of bottom operators affects its downstream operators ! Maximize Run-time Throughput of Join1 !! SIGMOD'06

Outline • Basics on State Spill • Plan-level Spill Strategies • Experimental Evaluation SIGMOD'06

2 1 2 1 4 3 4 3 2 Granularity : State Partitioning • Divide Input Streams into Large Number of Partitions • At run-time, only need to choose partitions to spill [DNS92,SH03] • Avoid expensive run-time repartitioning • Does not affect partitions that are not spilled m1 m2 • Example : • 300 partitions • M1 has odd IDs • M2 has even IDs Join Join Split Split Split A B C SIGMOD'06

Partition Granularity : Choose State? • Multiple States Exist from Different Inputs • Select States from One Input Only To disk ... • Select States with Same ID To disk ... • Avoid across-machine processing • Simplify spill management • Streamline cleanup process Partition Group Granularity! SIGMOD'06

00 0 (PA1 , PB1 , PC1 ) • The Results Have Been Generated V0 = PA10 PB10 PC10 V1 = PA11 PB11 PC11 ... Vk = PA1k PB1k PC1k Clean Up Stage • Partition Groups Could be Pushed Multiple Times , (PA11, PB11, PC11) , (PA12, PB12, PC12) ,..., (PA1k, PB1k, PC1k) • Incremental View Maintenance Algorithm [ZMH+95] • Treat Multiple Join as Materialized View • Partition Groups as Source Updates SIGMOD'06

After Merge • Combined States: PA10  PA11,PB10  PB11,PC10  PC11 • Final Result: V = (PA10  PA11) (PB10  PB11) (PC10  PC11) • Missing Results: = V - V0 - V1 • V-V0 = PA11 PB10 PC10  (PA10  PA11) PB11 PC10  (PA10  PA11) (PB10  PB11) PC11 Merge Disk Resident States • To Merge Two Partition Groups with Same ID • i.e., (PA10, PB10, PC10) and (PA11, PB11, PC11) • V0 = PA10 PB10 PC10, V1 = PA11 PB11 PC11 SIGMOD'06

State Spill Strategies SIGMOD'06

Which Partitions to Push? • Throughput-Oriented State Spill • Productivity of a partition group : • Poutput: Number of output tuples generated from partition group • Psize: Size of partition group in terms of number of tuples • Productivity:Poutput/Psize SIGMOD'06

Globally Choose Partition Groups • Rank Partitions Based on Productivity:Poutput/Psize • Choose globally least productive partitions to spill Disk Join3 Direct Extension : Local Output Method … State Spill E Join2 D Join1 A B C SIGMOD'06

Bottom Up Pushing Strategy • Spill States from Bottom Operators First • Choose partitions from Join1 until it reaches threshold k% • If not done, choose partitions from Join2, and so on Partition Selection: Randomly or using local productivity Join3 E Join2 • Minimize intermediate results in upstream operators (memory) • Minimize number of state spill processes D Join1 A B C Less spill process  Higher overall query throughput ? SIGMOD'06

2 1 2 10 p11 p21 1 p12 p22 ... ... OP1 OP2 It may worthwhile to push P21 instead of P11! Partition Interdependency • Smaller Number of Spill Processes  High Throughput !! • Partition pushed in bottom operator may be parent for productive partitions in its downstream operators • Global Strategy : Account for Dependency Relationships ! SIGMOD'06

Update Poutput values of partitions in Join3 Join3 Split2 SplitE • Apply Split2 to each tuple and find corresponding partitions from Join2, and update its Poutput value Join2 E Split1 SplitD Join1 D • And so on … SplitA SplitB SplitC A B C “True” Global Output Strategy • Poutput: Contribution to Final Query Output • Employ lineage tracing algorithm to update Poutput statistics k SIGMOD'06

Global Output with Penalty • Incorporate Intermediate Result Sizes P11: Psize = 10, Poutput=20 P12: Psize = 10, Poutput=20 1 2 2 p11 p2i 2 ... 1 ... ... p12 p2j 20 OP1 OP2 • Intermediate Result Factor Pinter • Productivity value: Poutput/(Psize + Pinter) SIGMOD'06

4 2 3 3 4 1 2 p11 p21 ... p31 ... p41 ... p12 ... p2j p3j p4j OP1 OP2 OP3 OP4 3+4 4 2+3+4 Global Penalty : Tracing Pinter • Penalty Pinter : Contribution to Intermediate Result Sizes • Apply Similar Lineage Tracing Algorithm for Pinter 2 3 4 SIGMOD'06

CAPE System Overview [LZ+05, TLJ+05] Query Processor Distribution Manager Connection Manager Local Statistics Gatherer Local Adaptation Controller Query Plan Manager Runtime Monitor CAPE-Continuous Query Processing Engine Global Adaptation Controller Repository Data Distributor Data Receiver Repository Streaming Data Network End User Application Server Stream Generator SIGMOD'06

Experimental Setup : Queries and Data • Inputs: A, B, C, D, and E data streams • Query : Join1:A1=B1=C1, Join2:C2=D1, Join3:D2=E1 • Query Operators : Use symmetric hash join • Each input stream is partitioned into 300 partitions • Query is partitioned and run in two machines • Memory threshold for spill : 60MB • Push 30% of states in each state spill • Average tuple inter-arrival time 50ms from each input SIGMOD'06

Experimental Setup • High Performance PC cluster • Dual 2.4GHz CPUs, 2G Memory, Gigabit Network • 3 Machines for Stream Generator, Application Server, and Distribution Manager. • Each Query Processor on Separate Machine • Generated Data Streams with Integer Join Column Values • Data value V appears R times for every K input tuples • Tuple Range : K • Range Join Ratio : R • Average Join Rate : Average number of tuples with same join value per input SIGMOD'06

Amount of State Pushed Each Adaptation Percentage: # of Tuples Pushed / Total # of Tuples Percentage Spilled per Adaptation Run-Time Query Throughput Run-Time Main Memory Usage (Input Rate: 30ms/Input, Tuple Range:30K, Join Ratio:3, Adaptation threshold: 200MB) SIGMOD'06

Experiment : Throughput & Memory Query with Average Join Rate: Join1: 3, Join2: 1, Join3: 1 SIGMOD'06

Experiment : Throughput Comparison Query with Average Join Rate: Join1: 1, Join2: 3, Join3: 3 Query with Average Join Rate: Join1: 3, Join2: 2, Join3: 3 SIGMOD'06

Experimental Summary • Productivity metric improves run-time throughput • Global-output-with-penality is overall winner • Global output (with and without penality) outperform alternates in runtime throughput • Global output (with and without penality) have similar (good) cleanup costs • Bottom-up strategy has lowest # of adaptations, yet poor performer and high cleanup costs SIGMOD'06

Conclusions • Identified Problem of Plan-Spill • State spill using “productivity” viable • Proposed Plan-Level Spill Policies • Dependencies considered for multi-operator plans • Evaluated Spill Policies • Global spill solutions improve throughput SIGMOD'06

Thank You ! Questions ? SIGMOD'06

Acknowledgments • DSRG students contributed to CAPE code base, including Luping Ding, Bin Liu, Tim Sutherland, Brad Pielech, Rimma Nehme, Mariana Jbantova, Brad Momberger, Song Wang, Natasha Bogdanova • Thanks to National Science Foundation for partial support via IDM and equipment grants, to WPI for RDC grant, and to NEC for student support SIGMOD'06

Run-Time Operator State Spilling for Memory Intensive Long-Running Queries