MiniCon Reformulation & Adaptive Re-Optimization

MiniCon Reformulation& Adaptive Re-Optimization Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 23, 2005

Administrivia Next reading assignment: • Urhan & Franklin – Query Scrambling • Ives et al. – Adaptive Data Partitioning • Compare the different approaches • One-page proposal of your project scope, goals, and means of assessing success/failure due next Monday, Feb. 28th

Today’s Trivia Question

Buckets, Rev. 2: The MiniCon Algorithm • A “much smarter” bucket algorithm: • In many cases, we don’t need to perform the cross-product of all items in all buckets • Eliminates the need for the containment check • This – and the Chase & Backchase strategy of Tannen et al – are the two methods most used in virtual data integration today

Minicon Descriptions (MCDs) • Basically, a modification to the bucket approach • “head homomorphism” – defines what variables must be equated • Variable-substituted version of the subgoals • Mapping of variable names • Info about what’s covered • Property 1: • If a variable occurs in the head of a query, then there must be a corresponding variable in the head of the MCD view • If a variable participates in a join predicate in the query, then it must be in the head of the view

MCD Construction For each subgoal of the query For each subgoal of each view Choose the least restrictive head homomorphism to match the subgoal of the query If we can find a way of mapping the variables, then add MCD for each possible “maximal” extension of the mapping that satisfies Property 1

MCDs for Our Example q(t) :- show(i, t, y, g), rating(i, r, s), r = 5 5star(i)  show(i, t, y, g), rating(i, 5, s) TVguide(t,y,g,r)  show(i, t, y, g), rating(i, r, “TVGuide”) movieInfo(i,t,y,g)  show(i, t, y, g) critics(i,r,s)  rating(i, r, s) goodMovies(t,y)  show(i, t, 1997, “drama”), rating(i, 5, s) good98(t,y)  show(i, t, 1998, “drama”), rating(i, 5, s)

Combining MCDs • Now look for ways of combining pairwise disjoint subsets of the goals • Greatly reduces the number of candidates! • Also proven to be correct without the use of a containment check • Variations need to be made for: • Constants in general (I sneaked those in) • “Semi-interval” predicates (x <= c) • Note that full-blown inequality predicates are co-NP-hard in the size of the data, so they don’t work

MiniCon Performance, Many Rewritings

Larger Query, Fewer Rewritings

MiniCon and LAV Summary • The state-of-the-art for AQUV in the relational world of data integration • It’s been extended to support “conjunctive XQuery” as well • Scales to large numbers of views, which we need in LAV data integration • A similar approach: Chase & Backchase by Tannen et al. • Slightly more general in some ways – but: • Produces equivalent rewritings, not maximally contained ones • Not always polynomial in the size of the data

Motivations for Adaptive Query Processing Many domains where cost-based query optimization fails: Complex queries in traditional databases: estimation error grows exponentially with # joins [IC91] Querying over the Internet: unpredictable access rates, delays Querying external data sources: limited information available about properties of this source Monitor real-world conditions, adapt processing strategy in response

Can We Get RDBMS-Level Optimizationfor Data Integration, without Statistics? Multiple remote sources • Described and mapped “loosely” • Data changes frequently Generally, would like to support same kinds of queries as in a local setting Results Query Data Integration System Mediated Schema Source Catalog Schema Mappings Remote, Autonomous Data Sources

What Are the Sources of Inefficiency? • Delays – we “stall” in waiting for I/O • We’ll talk about this on Monday • Bad estimation of intermediate result sizes • The focus of the Kabra and DeWitt paper • No info about source cardinalities • The focus of the eddies paper – and Monday’s paper • The latter two are closely related • Major challenges: • Trading off information acquisition (exploration) vs. use (exploitation) • Extrapolating performance based on what you’ve seen so far

Kabra and DeWitt • Goal: “minimal update” to a traditional optimizer in order to compensate for bad decisions • General approach: • Break the query plan into stages • Instrument it • Allow for re-invocation of optimizer if it’s going awry

Elements of Mid-Query Re-Optimization • Annotated Query Execution Plans • Annotate plan with estimates of size • Runtime Collection of Statistics • Statistics collectors embedded in execution tree • Keep overhead down • Dynamic Resource Re-allocation • Reallocate memory to individual operations • Query Plan Modification • May wish to re-optimize the remainder of query

Annotated Query Plans • We save at each point in the tree the expected: • Sizes and cardinalities • Selectivities of predicates • Estimates of number of groups to be aggregated

Statistics Collectors • Add into tree • Must be collectable in a single pass • Will only help with portions of query “beyond” the current pipeline

Resource Re-Allocation • Based on improved estimates, we can modify the memory allocated to each operation • Results: less I/O, better performance • Only for operations that have not yet begun executing, i.e., not in the pipeline

Plan Modification • Only re-optimize part not begun • Suspend query, save intermediate in temp file • Create new plan for remainder, treating temp as an input

Re-Optimization When to re-optimize: • Calculate time current should take (using gathered stats) • Only consider re-optimization if: • Our original estimate was off by at least some factor 2 and if • Topt, estimated < 1Tcur-plan,improved where 1  5% and cost of optimization depends on number of operators, esp. joins • Only modify the plan if the new estimate, including the cost of writing the temp file, is better

Low-Overhead Statistics Want to find “most effective” statistics • Don’t want to gather statistics for “simple” queries • Want to limit effect of algorithm to maximum overhead ratio,  • Factors: • Probability of inaccuracy • Fraction of query affected How do we know this without having stats?

Inaccuracy Potentials The following heuristics are used: • Inaccuracy potential = low, medium, high • Lower if we have more information on table value distribution • 1+max of inputs for multiple-input selection • Always high for user-defined methods • Always high for non-equijoins • For most other operators, same as worst of inputs

More Heuristics • Check fraction of query affected • Check how many other operators use the same statistic • The winner: • Higher inaccuracy potentials first • Then, if a tie, the one affecting the larger portion of the plan

Implementation • On top of Paradise (parallel database that supports ADTs, built on OO framework) • Using System-R optimizer • New SCIA (Stat Collector Insertion Algorithm) and Dynamic Re-Optimization modules

It Works! • Results are 5% worse for simple queries, much better for complex queries • Of course, we would not really collect statistics on simple queries • Data skew made a slight difference - both normal and re-optimized queries performed slightly better

Pros and Cons • Provides significant potential for improvement without adding much overhead • Biased towards exploitation, with very limited information-gathering • A great way to retrofit an existing system • In SIGMOD04, IBM had a paper that did this in DB2 • But fairly limited to traditional DB context • Relies on us knowing the (rough) cardinalities of the sources • Query plans aren’t pipelined, meaning: • If the pipeline is broken too infrequently, MQRO may not help • If the pipeline is broken too frequently, time-to-first-answer is slow

The Opposite Extreme: Eddies • The basic idea: • Query processing consists of sending tuples through a series of operators • Why not treat it like a routing problem? • Rely on “back-pressure” (i.e., queue overflow) to tell us where to send tuples • Part of the ongoing Telegraph project at Berkeley • Large-scale federated, shared-nothing data stream engine • Variations in data transfer rates • Little knowledge of data sources

Telegraph Architecture • Simple “pre-optimizer” to generate initial plan • Creates operators, e.g.: • Select: predicate(sourceA) • Join: predicate(sourceA, sourceB) • (No support for aggregation, union, etc.) • Chooses implementations Select using index, join using hash join • Goal: dataflow-driven scheduling of operations • Tuple comes into system • Adaptively routed through operators in “eddies” May be combined, discarded, etc.

Can’t Always Re-order Arbitrarily • Need “moment of symmetry” • Some operators have scheduling dependency • e.g. nested loops join: for each tuple in left table for each tuple in right table If tuples meet predicate, output result • Index joins, pipelined hash joins always symmetric • Sometimes have order restrictions e.g. request tuple from one source, ship to another

The Eddy • Represents set of possible orderings of a subplan • Each tuple may “flow” through a different ordering (which may be constrained) • N-ary module consisting of query operators • Basic unit of adaptivity • Subplan with select, project, join operators U

Example Join Subplan Alternatives Join(R3R1.x = R2.x, JoinR1.x = R3.x(R1, R3)) R1.x = R3.x R1.x = R3.x R1 R2.x = R3.x R3 R2.x = R1.x R2 R3 R2 R1 R1.x = R3.x R1.x = R3.x … R3 R2.x = R3.x R1 R2.x = R3.x R1 R2 R2 R3

Naïve Eddy • Given tuple, route to operator that’s ready • Analogous to fluid dynamics • Adjusts to operator costs • Ignores operator selectivity (reduction of input)

Adapting to Variable-Cost Selection

But Selectivity is Ignored

Lottery-Based Eddy • Need to favor more selective operators • Ticket given per tuple input, returned per output • Lottery scheduling based on number of tickets • Now handles both selectivity and cost

Enhancing Adaptivity: Sliding Window • Tickets were for entire query • Weighted sliding window approach • “Escrow” tickets during a window • “Banked” tickets from a previous window

What about Delays? Problems here: Don’t know when join buffers vs. “discards” tuples T S R (SLOW)

Eddy Pros and Cons • Mechanism for adaptively re-routing queries • Makes optimizer’s task simpler • Can do nearly as well as well-optimized plan in some cases • Handles variable costs, variable selectivities • But doesn’t really handle joins very well – attempts to address in follow-up work: • STeMs – break a join into separate data structures; requires re-computation at each step • STAIRs – create intermediate state and shuffle it back and forth

Other Areas Where Things Can be Improved • The work of pre-optimization or re-optimization • Choose good initial/next query plan • Pick operator implementations, access methods • STeMs fix this • Handle arbitrary operators Aggregation, outer join, sorting, … Next time we’ll see techniques to address this • Distribute work • Distributed eddies (by DeWitt and students)

MiniCon Reformulation & Adaptive Re-Optimization