1 / 81

Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications

Learn about the history and background of query optimization, relational algebra equivalences, and plan cost estimation in database applications.

mcoriell
Download Presentation

Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Carnegie Mellon Univ.Dept. of Computer Science15-415/615 - DB Applications C. Faloutsos – A. Pavlo Lecture#15: Query Optimization

  2. Today’s Class • History & Background • Relational Algebra Equivalences • Plan Cost Estimation • Plan Enumeration CMU SCS 15-415/615

  3. Query Optimization • Remember that SQL is declarative. • User tells the DBMS what answer they want, not how to get the answer. • There can be a big difference in performance based on plan is used: • See last week: 5.7 days vs. 45 seconds CMU SCS 15-415/615

  4. 1970s – Relational Model • Ted Codd saw the maintenanceoverhead for IMS/Codasyl. • Proposed database abstraction basedon relations: • Store database in simple data structures. • Access it through high-level language. • Physical storage left up to implementation. Codd CMU SCS 15-415/615

  5. IBM System R • Skunkworks project at IBM Research in San Jose to implement Codd’s ideas. • Had to figure out all of the things that we are discussing in this course themselves. • IBM never commercialized System R. CMU SCS 15-415/615

  6. IBM System R • First implementation of a query optimizer. • People argued that the DBMS could never choose a query plan better than what a human could write. • A lot of the concepts from System R’s optimizer are still used today. CMU SCS 15-415/615

  7. Today’s Class • History & Background • Relational Algebra Equivalences • Plan Cost Estimation • Plan Enumeration • Nested Sub-queries CMU SCS 15-415/615

  8. Relational Algebra Equivalences • A query can be expressed in different ways. • The optimizer considers variations and choose the one with the lowest cost. • How do we know whether two queries are equivalent? CMU SCS 15-415/615

  9. Relational Algebra Equivalences • Two relational algebra expressions are equivalentif they generate the same set of tuples. CMU SCS 15-415/615

  10. Predicate Pushdown SELECT name, cid FROM student, enrolled WHERE student.sid = enrolled.sid AND enrolled.grade = ‘A’ p p name, cid name, cid s ⨝ sid=sid grade=‘A’ ⨝ sid=sid grade=‘A’ s enrolled student enrolled student CMU SCS 15-415/615

  11. Relational Algebra Equivalences SELECT name, cid FROM student, enrolled WHERE student.sid = enrolled.sid AND enrolled.grade = ‘A’ pname, cid(sgrade=‘A’(student⋈enrolled)) = pname, cid(student⋈(sgrade=‘A’ (enrolled))) CMU SCS 15-415/615

  12. Relational Algebra Equivalences • Selections: • Perform them early • Break a complex predicate, and push downsp1∧p2∧…pn(R) = sp1(sp2(s…pn(R))…) • Simplify a complex predicate • (X=Y AND Y=3) → X=3 AND Y=3 CMU SCS 15-415/615

  13. Relational Algebra Equivalences • Projections: • Perform them early • Smaller tuples • Fewer tuples (if duplicates are eliminated) • Project out all attributes except the ones requested or required (e.g., joining attr.) • This is not important for a column store… CMU SCS 15-415/615

  14. Projection Pushdown SELECT name, cid FROM student, enrolled WHERE student.sid = enrolled.sid AND enrolled.grade = ‘A’ p name, cid p name, cid s grade=‘A’ ⨝ sid=sid ⨝ sid=sid p sid, cid grade=‘A’ p sid, name s enrolled student grade=‘A’ s enrolled student CMU SCS 15-415/615

  15. Relational Algebra Equivalences • Joins: • Commutative, associative • Q: How many different orderings are there for an n-way join? R ⋈ S = S ⋈ R (R ⋈ S) ⋈ T = R ⋈(S ⋈T) CMU SCS 15-415/615

  16. Relational Algebra Equivalences • Joins: How many different orderings are there for an n-way join? • A: Catalan number ~ 4n • Exhaustive enumeration: too slow. • We’ll see in a second how an optimizer limits the search space... CMU SCS 15-415/615

  17. Today’s Class • History & Background • Relational Algebra Equivalences • Plan Cost Estimation • Plan Enumeration CMU SCS 15-415/615

  18. Cost Estimation • How long will a query take? • CPU: Small cost; tough to estimate • Disk: # of block transfers • Memory: Amount of DRAM used • Network: # of messages • How many tuples will be read/written? • What statistics do we need to keep? CMU SCS 15-415/615

  19. SR #1 #2 #3 … #NR Cost Estimation – Statistics • For each relation R we keep: • NR→ # tuples • SR → size of tuple in bytes • V(A,R) → # of distinct valuesof attribute ‘A’ CMU SCS 15-415/615

  20. #1 #2 #3 … #BR Derivable Statistics SR • FR→ max# records/block • BR→ # blocks • SC(A,R)→ selection cardinality avg# of records with A=given FR CMU SCS 15-415/615

  21. Derivable Statistics • SC(A,R) → Selection Cardinalityavg# of records with A=given→ NR / V(A,R) • Note that this assumes data uniformity • 10,000 students, 10 colleges – how many students in SCS? CMU SCS 15-415/615

  22. Additional Statistics • For index i: • Fi → average fanout (~50-100) • HTi → # levels of index i (~2-3)~ log(#entries)/log(Fi) • LBi # → blocks at leaf level HTi CMU SCS 15-415/615

  23. Statistics • Where do we store them? • How often do we update them? • Manual invocations: • Postgres/SQLite: ANALYZE • MySQL: ANALYZE TABLE CMU SCS 15-415/615

  24. Selection Statistics • We saw simple predicates (name=“Kayne”) • How about more complex predicates, like • salary > 10000 • age=30 AND jobTitle=“Costermonger” • What is their selectivity? CMU SCS 15-415/615

  25. Selections – Complex Predicates • Selectivity sel(P) of predicate P:== fraction of tuples that qualify • Formula depends on type of predicate. • Equality • Range • Negation • Conjunction • Disjunction # of tuples sel(P) = SC(P) / NR Selection Cardinality CMU SCS 15-415/615

  26. Selections – Complex Predicates • Assume that V(rating, sailors) has 5 distinct values (0–4) and NR = 5 • Equality Predicate: A=constant • sel(A=constant) = SC(P)/ V(A,R) • Example: sel(rating=‘2’)= 1/5 SC(rating=‘2’)=1 V(rating,R)=5

  27. Selections – Complex Predicates • Range Query: • sel(A>a) = (Amax – a) / (Amax – Amin) • Example: sel(rating >= ‘2’) = (4 – 2) / (4 – 0) = 1/2 ratingmin = 0 ratingmax= 4

  28. Selections – Complex Predicates • Negation Query • sel(not P) = 1 – sel(P) • Example: sel(rating != ‘2’) • Observation: selectivity ≈ probability = 1 – (1/5) = 4/5 SC(rating!=‘2’)=2 SC(rating=‘2’)=1 SC(rating!=‘2’)=2

  29. Selections – Complex Predicates • Conjunction: • sel(rating = ‘2’ AND name LIKE ‘C%’) • sel(P1 ⋀ P2) = sel(P1) ∙ sel(P2) • INDEPENDENCE ASSUMPTION Not always true in practice! P1 P2 CMU SCS 15-415/615

  30. Selections – Complex Predicates • Disjunction: • sel(rating = ‘2’ OR name LIKE ‘C%’) • sel(P1 ⋁ P2) = sel(P1) + sel(P2) – sel(P1 ⋁ P2)= sel(P1) + sel(P2) – sel(P1) ∙ sel(P2) • INDEPENDENCE ASSUMPTION, again CMU SCS 15-415/615

  31. Selections – Complex Predicates • Disjunction, in general: • sel(P1OR P2OR … Pn) = • 1 - (1- sel(P1) ) ∙ (1 - sel(P2) ) ∙ … (1 - sel(Pn)) P1 P2 CMU SCS 15-415/615

  32. Joins • Q: Given a join of R and S, what is the range of possible result sizes in #of tuples? CMU SCS 15-415/615

  33. Result Size Estimation for Joins • General case: Rcols⋂ Scols= {A} where A is not a key for either table. • Hint: for a given tuple of R, how many tuples of S will it match? CMU SCS 15-415/615

  34. Result Size Estimation for Joins • General case: Rcols⋂ Scols= {A} where A is not a key for either table. • Match each R-tuple with S-tuples:estSize ≈NR∙ NS / V(A,S) • Symmetrically, for S:estSize ≈NR∙ NS / V(A,R) • Overall: • estSize ≈NR∙ NS / max( {V(A,S), V(A,R)} ) CMU SCS 15-415/615

  35. Cost Estimations • Our formulas are nice but we assume that data values are uniformly distributed. # of occurrences Distinct values of attribute

  36. Cost Estimations • Our formulas are nice but we assume that data values are uniformly distributed. Bucket #1 Count=8 Bucket #2 Count=4 Bucket #3 Count=15 Bucket #4 Count=3 Bucket #5 Count=14

  37. Histograms with Quantiles • A histogram type wherein the “spread” of each bucket is same. Bucket #1 Count=12 Bucket #2 Count=12 Bucket #3 Count=9 Bucket #4 Count=12

  38. Sampling • Modern DBMSs also employ sampling to estimate predicate selectivities. SELECT * FROM sailorsWHERE rating > 100 sel(rating>100) = ⋮ 1 billion tuples = 1/3 CMU SCS 15-415/615

  39. Today’s Class • History & Background • Relational Algebra Equivalences • Plan Cost Estimation • Plan Enumeration CMU SCS 15-415/615

  40. Query Optimization • Bring query in internal form into “canonical form” (syntactic q-opt) • Generate alternative plans. • Single relation. • Multiple relations. • Nested sub-queries. • Estimate cost for each plan. • Pick the best one. CMU SCS 15-415/615

  41. #1 #2 #3 #BR Plan Generation SR SELECT * FROM sailors WHERE rating = 2 • What are our plan options? FR … CMU SCS 15-415/615

  42. #1 #2 #3 … #BR Plan Generation SR SELECT * FROM sailors WHERE rating = 2 • Sequential Scan • Binary Search • if sorted & consecutive • Index Search • if an index exists FR CMU SCS 15-415/615

  43. #1 #2 #3 … #BR Sequential Scan SR SELECT * FROM sailors WHERE rating = 2 • BR (worst case) • BR /2 (on average, if we searchfor primary key) FR CMU SCS 15-415/615

  44. #1 #2 #3 … #BR Binary Search SR SELECT * FROM sailors WHERE rating = 2 • ~log(BR) + SC(A,R)/ FR • Extra blocks are ones thatcontain qualifying tuples FR CMU SCS 15-415/615

  45. #1 #2 #3 … #BR Binary Search SR SELECT * FROM sailors WHERE rating = 2 We showed that estimating this is non-trivial. • ~log(BR) + SC(A,R)/ FR • Extra blocks are ones thatcontain qualifying tuples FR CMU SCS 15-415/615

  46. #1 #2 #3 … #BR Index Search SR SELECT * FROM sailors WHERE rating = 2 • Index Search: • levels of index + blocks w/ qual. tuples FR Case#1: Primary Key Case#2: Secondary key – clustering index Case#3: Secondary key – non-clust. index CMU SCS 15-415/615

  47. #1 #2 #3 … #BR HTi Index Search: Case #1 SR SELECT * FROM sailors WHERE rating = 2 • Primary Key • cost: HTi + 1 CMU SCS 15-415/615

  48. #1 #2 #3 … #BR HTi Index Search: Case #2 SR SELECT * FROM sailors WHERE rating = 2 • Secondary key withclustering index: • cost: HTi + SC(A,R)/FR CMU SCS 15-415/615

  49. #1 #2 #3 … #BR HTi Index Search: Case #3 SR SELECT * FROM sailors WHERE rating = 2 • Secondary key withnon-clustering index: • cost: HTi + SC(A,R) … CMU SCS 15-415/615

  50. Query Optimization • Bring query in internal form into “canonical form” (syntactic q-opt) • Generate alternative plans. • Single relation. • Multiple relations. • Nested sub-queries. • Estimate cost for each plan. • Pick the best one. CMU SCS 15-415/615

More Related