- 56 Views
- Uploaded on
- Presentation posted in: General

Robust Query Processing through Progressive Optimization

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Robust Query Processing through Progressive Optimization

Volker Markl, Vijayshankar Raman, David Simmen, Guy Lohman,

Hamid Pirahesh, Miso Cilimdzic

SIGMOD 2004

Modified by S. Sudarshan from talk by Raja Agrawal

- Current optimizers depend heavily upon the cardinality estimations
- What if there errors in those estimations?
- Errors can occur due to …
- Inaccurate statistics
- Invalid assumptions (e.g. attribute independence)

- Idea: lazily trigger reoptimization during execution if cardinality counts indicate current plan is suboptimal
- introduces checkpoint (CHECK) operator to compare actual vs estimated cardinality
- key idea: precompute cardinality ranges for which plan is optimal

- Risk Vs Opportunity
- Risk:
- Extent to which re-optimization is not worthwhile
- reoptimization chooses another bad plan
- work redone
- cardinality errors may even cancel, and fixing one may give an even worse plan!

- Extent to which re-optimization is not worthwhile
- Opportunity:
- Refers to the aggressiveness
- more CHECK operators..

- Refers to the aggressiveness

Risk

- Redbrick
- Star schema with fact table and multiple dimension tables
- First apply selections on dimension tables
- Then decide what plan to use

- Kabra & DeWitt 98 (KD98)
- Introduced idea of mid-query reoptimization
- Allow partial results to be use like materialized views
- But ad-hoc cardinality threshold, and only reoptimize fully materialized plans

Opportunity

Risk

- Tukwila data integration system
- optimizer may have no idea of statistics
- interleave optimization and query execution
- partial query plans
- Fragment: fully pipelined tree with doubly pipelined hash join

- Query Scrambling
- reorder query to deal with delayed sources

Opportunity

Risk

- Eddies (Telegraph)
- Ingres/DEC Rdb: run multiple access methods competitively then choose
- Parametric Query Optimization (PQO)
- e.g. Cole and Graefe 94, Hulgeri and Sudarshan 02
- Choose from a set of plans, each optimal for selectivity range
- POP: converse: find optimal cardinality range for a give plan

Opportunity

- CHECK operator to find if a plan is suboptimal
- At optimization time, find out cardinality range (at CHECK location) for which plan is optimal
- At run time, ensure cardinality within [l,u]
- If violated, stop plan execution and reoptimize

- Location of CHECKs
- Re-optimize
- taking observed cardinality into account, and
- exploiting intermediate results where beneficial
- Heuristic: limit number of reoptimizations (default: 3)

- Consider a plan edge e that flows rows into operator o,
- let P be the subplan rooted at o.
- The validity rangefor e is an upper and lower bound on the number of rows flowing through e, such that if the range is violated at runtime, we can guarantee P is suboptimal

- Ad-hoc thresholds (proposed earlier) are a bad idea
- E.g. even a 100x error on very small relation may not make a difference in optimal plan

- Plan Popt with root operator oopt is being compared with another plan Palt different only in the root operator oalt.

- Need to solve
- cost(Palt , c) – cost(Popt , c) = 0
- where c is the cardinality on edge e

- Cost functions can be complex/non-linear/non-continuous

- Detects suboptimality of the root operator where Popt and Palt share the same input edges.
- Validity range might miss a cross-over point with a plan that uses a different join order (and hence has different input edges).
- Two plans are structurally equivalent if they share the same set of edges
- where an edge is defined by the set of rows flowing through it during query execution.
- Allows different algorithms, and flipping inner/outer

- Theorem: …. Suppose edges edges ei1 , ei2 , … , eik are seen to be “erroneous” wrt cardinality. Then the following statements are equivalent:
- P is suboptimal with respect to another plan P' that has the same set of edges {e1 , e2 , … , em}
- At least one of Pi1 , Pi2 , … , Pik is suboptimal given the cardinality errors in those edges in {e1 , e2 , … , em } that lie under them.
- At least one of oi1 , oi2 , … , oik is a suboptimal operator given the cardinality errors in {e1 , e2 , … , em} that are in its input edges.

- Suppose we “detect” suboptimality of (R Join S) Join T wrt estimated costs of (R Join T) Join S
- During run time, we can never observe the cardinality of R Join T
- We would be making an arbitrary guess as to the correlation of the predicates on the R and T tables
- Best not to infer suboptimality wrt such estimates

- However, reoptimization may result in a different join order

- All the intermediate results are stored as temporary MVs
- with cardinalities available to the optimizer
- can be reused if it leads to a better plan
- but not necessarily used, e.g. if join result is very large, and a different join order is preferred

- must be reused if it has performed side-effects

- Reoptimization done as part of same transaction

- Variants applicable in different cases, trade off risk for opportunity
- Variants
- Lazy checking
- Lazy checking with eager materialization
- Eager checking without compensation
- Eager checking with buffering
- Eager checking with deferred compensation

- Adding CHECKs above a materialization point (SORT, TEMP etc)
- No results have been output yet
- And materialized results can be re-used
- very low overhead

- Can insert materialization point if it does not exists already
- Risk: overhead of materialization
- Typically done only for outer input of indexed nested-loop join
- low cost if outer is small (as estimated by optimizer)
- and INL is in trouble anyway if outer is large

- Lazy checking may be too late
- e.g. if very bad join order chosen, with huge intermediate results

- Idea: check even before entire result is materialized, and stop early
- Problem: what if some results have already been output?
- Compensation

- EC without Compensation:
- CHECK is pushed down the materialization point, into pipeline

- EC with buffering
- CHECK and buffer
- output from buffer once sure about bound
- e.g. [0,b), or [b,infinity]

- else reoptimize
- “delayed pipelining”

- Only SPJ queries
- Identifier of all rows returned to the user are stored in a table S, which is used later in the new plan for anti-join with the new-result stream

- LCEM and ECB – outer side of nested-loop join
- LC – above materialization points
- ECWC and ECDC – anywhere
- Do not place CHECKs if
- no alternative plan above CHECK
- simple queries with low estimated cost

- TPC-H Q10: Replace constant in selection on lineitem by parameter marker, so optimizer doesn’t know actual selectivity
- 5 different optimal plans

- Analyze LC, LCEM, ECB
- Can be reoptimized more than once
- Conclusion: low overhead/risk

- Goal: how often does opportunity to reoptimize arise?
- Introduce LC/LCEM/ECB checkpoints
- But turn off reoptimization, and run same plan
- Opportunity region for ECB: dotted line

- Real world workload (DMV data and queries)
- Complex predicates leading to cardinality estimation errors
- substring comparison, like, IN, ..

- Re-optimization may result in the choice of worse plan due to:
- Two estimation errors canceling out each other
- Re-using intermediate results

- POP gives us a robust mechanism for re-optimization through inserting of CHECK (in its various flavors)
- Higher opportunity at low risk