Download Presentation
## Adaptive Query Processing

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Adaptive Query Processing**Amol Deshpande, University of Maryland Joseph M. Hellerstein, University of California, Berkeley Vijayshankar Raman, IBM Almaden Research Center**Outline**• 20th Century Adaptivity: Intuition from the Classical Systems • Adaptive Selection Ordering • Adaptive Join Processing • Research Roundup**Outline**• 20th Century Adaptivity: Intuition from the Classical Systems • Data Independence and Adaptivity • The Adaptivity Loop • Two case studies • System R • INGRES • Tangential topics • Adaptive Selection Ordering • Adaptive Join Processing • Research Roundup**Data Independence Redux**dapp denv << dt dt • The taproot of modern database technology • Separation of specification (“what”) from implementation (“how”) • Refamiliarizing ourselves: Why do we care about data independence?**D. I. Adaptivity**• Query Optimization: the key to data independence • bridges specification and implementation • isolates static applications from dynamic environments • How does a DBMS account for dynamics in the environment? • This tutorial is on a 30-year-old topic • With a 21st-Century renaissance ADAPTIVITY**Why the Renaissance?**? denv dt • Breakdown of traditional query optimization • Queries over many tables • Unreliability of traditional cost estimation • Success & maturity make problems more apparent, critical • c.f. Oracle v6! • Query processing in new environments • E.g. data integration, web services, streams, P2P, sensornets, hosting, etc. • Unknown and dynamic characteristics for data and runtime • Increasingly aggressive sharing of resources and computation • Interactivity in query processing • Note two separate themes? • Unknowns: even static properties often unknown in new environments • and often unknowable a priori • Dynamics: can be very high -- motivates intra-query adaptivity**The Adaptivity Loop**Measure/Model Actuate Plan Need not happen at the same timescales!**An example query**Student • SQL: SELECT * FROM Professor P, Course C, Student S WHERE P.pid = C.pid AND S.sid = C.sid • QUEL: range of P is Professor range of C is Course range of S is Student RETRIEVE (P.ALL, C.ALL, S.ALL) WHERE P.pid = C.pid AND S.sid = C.sid Professor Course**System R Optimizer**cardinalitiesindex lo/hi key Dynamic Programming > UPDATE STATISTICS ❚ > SELECT * FROM ... ❚**System R Adaptivity**Measure/Model Actuate Plan Note different timescales**INGRES “Query Decomposition” 1**S P C > RANGE OF P IS ... ❚ OVQP(selections) hashed temps c > RANGE OF C_T IS … WHERE C_T.pid=44…❚ OVQP hashed temps**INGRES “Query Decomposition” 1**S P C > RANGE OF P IS ... ❚ OVQP(selections) hashed temps c > RANGE OF C_T IS … WHERE C_T.pid=44…❚ OVQP hashed temps > RANGE OF T_T IS … WHERE T_T.sid = 273❚ OVQP output tuples**INGRES “Query Decomposition” 2**S P C > RANGE OF P IS ... ❚ OVQP(selections) hashed temps > RANGE OF CT IS … WHERE CT.pid=26…❚ OVQP hashed temps > RANGE OF ST IS … WHERE ST.sid=441❚ OVQP output tuples**INGRES: Post-Mortem**Hash Hash P Hash Hash INL INL C Hash Hash INL S Hash Hash P C INL S • Case 1:P, S, PC • Case 2: P, PC, S Plan choice determined by number of C matches per P Each P tuple either Type1 or Type2.**Horizontal Partitioning**• “Post-mortem” behavior • Horizontal partitioning of inputs into different static plans [Ives02] • “Driving” input relation effectively partitioned by join keys • Each partition participates in a different static plan • Recurses up each different join tree • End result can be described as a union of static plans over partitions • In general, many such plans! • Note: post-mortem always has a relational description of some sort • But often “unusual”: plans that are simply not considered in System R! • Often cannot know the partitioning prior to query execution • So: plan-space and adaptivity loop settings have strong interactions! • A theme we’ll see throughout.**INGRES Adaptivity**Measure/Model Actuate Plan All ‘round the loop each time…**Observations on 20thC Systems**• Both INGRES & System R used adaptive query processing • To achieve data independence • They “adapt” at different timescales • Ingres goes ‘round the whole loop many times per query • System R decouples parts of loop, and is coarser-grained • measurement/modeling: periodic • planning/actuation: once per query • Query Post-Mortem reveals different relational plan spaces • System R is direct: each query mapped to a single relational algebra stmt • Ingres’ decision space generates a union of plans over horizontal partitions • this “super-plan” not materialized -- recomputed via FindMin • Both have zero-overhead actuation • Never waste query processing work**20th Century Summary**• System R’s optimization scheme deemed the winner for 25 years • Nearly all 20thC research varied System R’s individual steps • More efficient measurement (e.g. sampling) • More efficient/effective models (samples, histograms, sketches) • Expanded plan spaces (new operators, bushy trees, richer queries and data models, materialized views, parallelism, remote data sources, etc) • Alternative planning strategies (heuristic and enumerative) • Speaks to the strength of the scheme • independent innovation on multiple fronts • as compared with tight coupling of INGRES • But… minimal focus on the interrelationship of the steps • Which, as we saw from Ingres, also affects the plan space**21st Century Adaptive Query Processing**• (well, starts in late 1990’s) • Revisit basic architecture of System R • In effect, change the basic adaptivity loop! • As you examine schemes, keep an eye on: • Rate of change in the environment that is targeted • How radical the scheme is wrt the System R scheme • ease of evolutionary change • Increase in plan space: are there new, important opportunities? • even if environment is ostensibly static! • New overheads introduced • How amenable the scheme is to independent innovation at each step • Measure/Analyze/Plan/Actuate**Tangentially Related Work**• An incomplete list!!! • Competitive Optimization [Antoshenkov93] • Choose multiple plans, run in parallel for a time, let the most promising finish • 1x feedback: execution doesn’t affect planning after the competition • Parametric Query Optimization [INSS92, CG94, etc.] • Given partial stats in advance. Do some planning and prune the space. At runtime, given the rest of statistics, quickly finish planning. • Changes interaction of Measure/Model and Planning • No feedback whatsoever, so nothing to adapt to! • “Self-Tuning”/“Autonomic” Optimizers [CR94, CN97, BC02, etc.] • Measure query execution (e.g. cardinalities, etc.) • Enhances measurement, on its own doesn’t change the loop • Consider building non-existent physical access paths (e.g. indexes, partitions) • In some senses a separate loop – adaptive database design • Longer timescales**Tangentially Related Work II**• Robust Query Optimization [CHG02, MRS+04, BC05, etc.] • Goals: • Pick plans that remain predictable across wide ranges of scenarios • Pick least expected cost plan • Changes cost function for planning, not necessarily the loop. • If such functions are used in adaptive schemes, less fluctuation [MRS+04] • Hence fewer adaptations, less adaptation overhead • Adaptive query operators [NKT88, KNT89, PCL93a, PCL93b] • E.g. memory-adaptive sort and hash-join • Doesn’t address whole-query optimization problems • However, if used with AQP, can result in complex feedback loops • Especially if their actions affect each other’s models!**Extended Topics in Adaptive QP**• An incomplete list!! • Parallelism & Distribution • River [A-D03] • FLuX [SHCF03, SHB04] • Distributed eddies [TD03] • Data Streams • Adaptive load shedding • Shared query processing**Adaptive Selection Ordering**Title slide**Selection Ordering**• Complex predicates on relations common • Eg., on an employee relation: ((salary > 120000) AND (status = 2)) OR ((salary between 90000 and 120000) AND (age < 30) AND (status = 1)) OR … • Selection ordering problem Decide the order in which to evaluate the individual predicates against the tuples • We focus on evaluating conjunctive predicates (containing only AND’s) Example Query select * from R where R.a = 10 and R.b < 20 and R.c like ‘%name%’;**Why Study Selection Ordering**• Many join queries reduce to this problem • Queries posed against a star schema • Queries where only pipelined left-deep plans are considered • Queries involving web indexes • Increasing interest in recent years • Web indexes [CDY’95, EHJKMW’96, GW’00] • Web services [SMWM’06] • Data streams [AH’00, BMMNW’04] • Sensor Networks [DGMH’05] • Similar to many problems in other domains • Sequential testing (e.g. for fault detection) [SF’01, K’01] • Learning with attribute costs [KKM’05]**Why Study Selection Ordering**• Simpler to understand and analyze • Many fundamental AQP ideas can be demonstrated with these queries • Very good analytical and theoretical results known • No analogues for general multi-way joins • Big differences to look out for • These queries are stateless; queries involving joins arenot stateless • No burden of routing history • Selections are typically very inexpensive • The costs of AQP techniques become important**Execution Strategies**Pipelined execution (tuple-at-a-time) result R • For each tuple r Є R • Apply predicate R.a = 10 first; • If tuple satisfies the selection, apply R.b < 20; • If both satisfied, apply R.c like ‘%name%’; Operator-at-a-time execution R.c like … R.a = 10 R.b < 20 R1 R2 Materialize R1 Materialize R2 R.c like … R.a = 10 R.b < 20 R result • Apply predicate R.a = 10 to all tuples of R; materialize result as R1, • Apply predicate R.b < 20 to all tuples of R1; materialize result as R2, • …**Execution Strategies**Pipelined execution (tuple-at-a-time) result R Operator-at-a-time execution R.c like … R.a = 10 R.b < 20 R1 R2 Materialize R1 Materialize R2 R.c like … R.a = 10 R.b < 20 R result Preferred for selection ordering Fundamentally different from adaptivity perspective**Outline**• 20th Century Adaptivity: Intuition from the Classical Systems • Adaptive Selection Ordering • Setting and motivation • Four Approaches • Static Selinger-style optimization • KBZ Algorithm for independent selections [KBZ’86] • A 4-approx greedy algorithm for correlated selections [BMMNW’04] • Mid-query reoptimization [KD’98] • Adapted to handle selection ordering • A-Greedy [BMMNW’04] • Eddies [AH’00] • Other related work • Adaptive Join Processing • Research Roundup**Static Selinger-style Optimization**R.c like … R.a = 10 R.a = 10 R.b < 20 R.b < 20 R.c like … R R result result • Find a single order of the selections to be used for all tuples Query select * from R where R.a = 10 and R.b < 20 and R.c like ‘%name%’; Query plans considered 3! = 6 distinct plans possible**Static Selinger-style Optimization**R.c like … R.a = 10 R.b < 20 R result Independence assumption • Cost metric: CPU instructions • Computing the cost of a plan • Need to know the costs and the selectivities of the predicates R1 R2 R3 • costs c1 c2 c3 • selectivities s1 s2 s3 • cost per c1 + s1 c2 + s1 s2 c3 • tuple cost(plan) = |R| * (c1 + s1 * c2 + s1 * s2 * c3)**Static Selinger-style Optimization**1-subsets of predicates R.a = 10 R.b < 20 R.c like … 2-subsets of predicates R.a = 10 AND R.c like .. R.a = 10 AND R.b < 20 3-subsets of predicates R.a = 10 AND R.b < 20 AND R.c like … • Dynamic programming algorithm • Complexity: O(2n) Using 1-d histograms or random samples etc Compute optimal order and cost for Using 2-d histograms or random samples, or by assuming independence**Static Selinger-style Optimization**• KBZ algorithm for independent selections [KBZ’86] • Apply the predicates in the decreasing order of: (1 – s) / c where s = selectivity, c = cost • Correlated selections • NP-hard under several different formulations • E.g. when given a random sample of the relation • Greedy algorithm: • Apply the selection with the highest (1 - s)/c • Compute the selectivities of remaining selections over the result • Conditional selectivities • Repeat • Can be shown to be 4-approximate [BMMNW’04] • Best possible unless P = NP**Static Selinger-Style**Measure/Model Actuate Plan**Outline**• 20th Century Adaptivity: Intuition from the Classical Systems • Adaptive Selection Ordering • Setting and motivation • Four Approaches • Static Selinger-style optimization • KBZ Algorithm for independent selections [KBZ’86] • A 4-approx greedy algorithm for correlated selections [BMMNW’04] • Mid-query reoptimization [KD’98] • Adapted to handle selection ordering • A-Greedy [BMMNW’04] • Eddies [AH’00] • Other related work • Adaptive Join Processing • Research Roundup**Mid-query Reoptimization**R1 R2 R3 Materialize R1 R.c like … R.a = 10 R.b < 20 R result A free opportunity to re-evaluate the rest of the query plan - Exploit by gathering information about the materialized result • At materialization points, re-evaluate the rest of the query plan • Example: Initial query plan chosen Estimated selectivities 0.05 0.1 0.2**Mid-query Reoptimization**R1 R2 Materialize R1; build 1-d hists R3 R.c like … R.a = 10 R.b < 20 R result A free opportunity to re-evaluate the rest of the query plan - Exploit by gathering information about the materialized result • At materialization points, re-evaluate the rest of the query plan • Example: Initial query plan chosen Estimated selectivities 0.05 0.1 0.2**Mid-query Reoptimization**R2 R3 R.c like … R.b < 20 Significantly different original plan probably sub-optimal Reoptimize the remainingpart of the query • At materialization points, re-evaluate the rest of the query plan • Example: Initial query plan chosen R1 Materialize R1; build 1-d hists Materialize R1 R.a = 10 R Estimated selectivities 0.05 0.1 0.2 Re-estimated selectivities 0.5 0.01**Mid-query Reoptimization**• Explored plan space identical to static • The operators are applied to the tuples in the same order • The order is determined lazily • The specific approach equivalent to the 4-Approx Greedy algorithm • Cost of adaptivity: • Materialization cost • Many (join) query plans typically have materialization points • May want to introduce materialization points if there is high uncertainty • Constructing statistics on intermediate results • Depends on the statistics maintained • Re-optimization cost • Optimizer should be re-invoked only if the estimates are significantly wrong**Mid-query Reoptimization**• Advantages: • Easy to implement in a traditional query processing system • Familiar plan space; easy to optimize and understand what's going on • Operator-at-a-time query processing • Disadvantages: • Granularity of adaptivity is coarse • Once an operator starts executing, can’t change that decision • Explored plan space identical to static optimization • Can’t apply different orders to different sets of tuples • Requires materialization • Cost of materialization can be high • Ill-suited for data streams and similar environments**Mid-query Adaptivity**Measure/Model Actuate Plan**Outline**• 20th Century Adaptivity: Intuition from the Classical Systems • Adaptive Selection Ordering • Setting and motivation • Four Approaches • Static Selinger-style optimization • KBZ Algorithm for independent selections [KBZ’86] • A 4-approx greedy algorithm for correlated selections [BMMNW’04] • Mid-query reoptimization [KD’98] • Adapted to handle selection ordering • A-Greedy [BMMNW’04] • Eddies [AH’00] • Other related work • Adaptive Join Processing • Research Roundup**Adaptive Greedy [BMMNW’04]**R1 R2 R3 R.c like … R.c like … R.a = 10 R.a = 10 R.b < 20 R.b < 20 R result Costs 1 unit 1 unit 1 unit Initial estimated selectivities 0.05 0.1 0.2 • Context: Pipelined query plans over streaming data • Example: Three independent predicates Optimal execution plan orders by selectivities (because costs are identical)**Adaptive Greedy [BMMNW’04]**R1 R2 R3 R.c like … R.a = 10 R.b < 20 R result Profile • Monitor the selectivities • Switch order if the predicates not ordered by selectivities Randomly sample R.a = 10 estimate selectivities of the predicates over the tuples of the profile R.b < 20 Rsample R.c like … Reoptimizer IF the current plan not optimal w.r.t. these new selectivities THEN reoptimize using the Profile**Adaptive Greedy [BMMNW’04]**R1 Randomly sample R2 R3 R.c like … R.a = 10 R.b < 20 R result R.a = 10 monitor selectivities sel(R.a = 10), sel(R.b < 20), sel(R.c …) R.b < 20 Rsample (Profile) R.c like … monitor conditional selectivities sel(R.b < 20 | R.a = 10) sel(R.c like … | R.a = 10) sel(R.c like … | R.a = 10 and R.b < 20) • Correlated Selections • Must monitor conditional selectivities Reoptimizer Uses conditional selectivities to detect violations Uses the profile to reoptimize O(n2) selectivities need to be monitored**Adaptive Greedy [BMMNW’04]**• Cost of adaptivity: • Profile maintenance • Must evaluate a (random) fraction of tuples against all operators • Detecting violations • Periodic checks for detecting if the current order is optimal • Doing this per tuple too expensive • Reoptimization cost • Can require multiple passes over the profile**Adaptive Greedy: Post-Mortem**Plan switch point R.c like … R.c like … R.a = 10 R.b < 20 R.b < 20 R.a= 10 order of arrival • Plan Space explored • “Horizontal partitioning” by order of arrival . . If the selectivities correlated with tuple arrival order, this can lead to huge savings**Adaptive Greedy [BMMNW’04]**• Advantages: • Can adapt very rapidly • Theoretical guarantees on performance • Not known for any other AQP protocols • Disadvantages: • Limited applicability • Only applies to selection ordering and specific types of join queries • Possibly high runtime overheads • Several heuristics described in the paper**A-Greedy Adaptivity**Measure/Model Actuate Plan**Outline**• 20th Century Adaptivity: Intuition from the Classical Systems • Adaptive Selection Ordering • Setting and motivation • Four Approaches • Static Selinger-style optimization • KBZ Algorithm for independent selections [KBZ’86] • A 4-approx greedy algorithm for correlated selections [BMMNW’04] • Mid-query reoptimization [KD’98] • Adapted to handle selection ordering • A-Greedy [BMMNW’04] • Eddies [AH’00] • Other related work • Adaptive Join Processing • Research Roundup