1 / 53

Temple University – CIS Dept. CIS331– Principles of Database Systems

This overview provides an introduction to query processing in database systems, covering topics such as the relational model, SQL, normalization, indexing, query optimization, and join strategies.

nathanield
Download Presentation

Temple University – CIS Dept. CIS331– Principles of Database Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Temple University – CIS Dept.CIS331– Principles of Database Systems V. Megalooikonomou Query Processing (based on notes by C. Faloutsos at CMU)

  2. General Overview - rel. model • Relational model - SQL • Functional Dependencies & Normalization • Physical Design; Indexing • Query processing/optimization • Transaction processing • Advanced topics • Distributed Databases • OO- and OR-DBMSs

  3. Data-files catalog Overview of a DBMS Naïve user casual user DBA DML parser DDL parser DML precomp. trans. mgr buffer mgr

  4. Overview - detailed • Motivation - Why q-opt? • Equivalence of expressions • Cost estimation • Cost of indices • Join strategies

  5. Why Q-opt? • SQL: ~declarative • good q-opt -> big difference • e.g., seq. Scan vs B-tree index, on P=1,000 pages

  6. Q-opt steps bring query in internal form (e.g., parse tree) … into ‘canonical form’ (syntactic q-opt) generate alternative plans estimate cost; pick best

  7. Q-opt - example p s TAKES STUDENT Canonical form p select name from STUDENT, TAKES where c-id=‘CIS331’ and STUDENT.ssn=TAKES.ssn s STUDENT TAKES

  8. Q-opt - example p s TAKES STUDENT Hash join; merge join; nested loops; Index; seq. scan

  9. Overview - detailed • Why q-opt? • Equivalence of expressions • Cost estimation • Cost of indices • Join strategies

  10. Equivalence of expressions • … or syntactic q-opt • In short: perform selections and projections early • More details: see transformation rules in text

  11. Equivalence of expressions • Q: How to prove a transformation rule? • A: use TRC, to show that LHS = RHS, e.g.:

  12. Equivalence of expressions

  13. Equivalence of expressions

  14. Equivalence of expressions • Selections • perform them early • break a complex predicate, and push • simplify a complex predicate • (‘X=Y and Y=3’) -> ‘X=3 and Y=3’

  15. Equivalence of expressions • Projections • perform them early (but carefully…) • Smaller tuples • Fewer tuples (if duplicates are eliminated) • project out all attributes except the ones requested or required (e.g., joining attr.)

  16. Equivalence of expressions • Joins • Commutative , associative • Q: n-way join - how many diff. orderings? … Exhaustive enumeration too slow…

  17. Q-opt steps bring query in internal form (e.g., parse tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans estimate cost; pick best

  18. Cost estimation • E.g., find ssn’s of students with an ‘A’ in CIS331 (using seq. scanning) • How long will a query take? • CPU (but: small cost; decreasing; tough to estimate) • Disk (mainly, # block transfers) • How many tuples will qualify? • (what statistics do we need to keep?)

  19. Sr #1 #2 #3 … #nr Cost estimation • Statistics: for each relation ‘r’ we keep • nr : # tuples; • Sr : size of tuple in bytes

  20. Cost estimation Sr • Statistics: for each relation ‘r’ we keep • … • V(A,r): number of distinct values of attr. ‘A’ • (recently, histograms, too) #1 #2 #3 … #nr

  21. Sr fr #1 #2 … #br Derivable statistics • fr: blocking factor = max# records/block (=?? ) • br: # blocks (=?? ) • SC(A,r) = selection cardinality = avg# of records with A=given (=?? )

  22. Derivable statistics • fr: blocking factor = max# records/block (= B/Sr ; B: block size in bytes) • br: # blocks (= nr / fr )

  23. Derivable statistics • SC(A,r) = selection cardinality = avg# of records with A=given (= nr / V(A,r) ) (assumes uniformity...) – eg: 30,000 students, 10 colleges – how many students in CST?

  24. Additional quantities we need: • For index ‘i’: • fi: average fanout - degree (~50-100) • HTi: # levels of index ‘i’ (~2-3) • ~ log(#entries)/log(fi) • LBi: # blocks at leaf level HTi

  25. Statistics • Where do we store them? • How often do we update them?

  26. Q-opt steps bring query in internal form (e.g., parse tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans selections; sorting; projections joins estimate cost; pick best

  27. Sr fr #1 #2 … #br Cost estimation + plan generation • Selections – e.g., select * from TAKES where grade = ‘A’ • Plans?

  28. Sr fr #1 #2 … #br Cost estimation + plan generation • Plans? • seq. scan • binary search • (if sorted & consecutive) • index search • if an index exists

  29. Sr fr #1 #2 … #br Cost estimation + plan generation seq. scan – cost? • br (worst case) • br/2 (average, if we search for primary key)

  30. Sr fr #1 #2 … #br Cost estimation + plan generation binary search – cost? if sorted and consecutive: • ~log(br) + • SC(A,r)/fr (=#blocks spanned by qualified tuples) • -1

  31. Sr fr #1 #2 … #br Cost estimation + plan generation estimation of selection cardinalities SC(A,r): non-trivial

  32. Sr fr #1 #2 … #br Cost estimation + plan generation method#3: index – cost? • levels of index + • blocks w/ qual. tuples ... case#1: primary key case#2: sec. key – clustering index case#3: sec. key – non-clust. index

  33. Sr fr #1 #2 … #br HTi Cost estimation + plan generation method#3: index – cost? • levels of index + • blocks w/ qual. tuples .. case#1: primary key – cost: HTi + 1

  34. HTi Cost estimation + plan generation Sr method#3: index - cost? • levels of index + • blocks w/ qual. tuples #1 fr #2 case#2: sec. key – clustering index OR prim. index on non-key …retrieve multiple records HTi + SC(A,r)/fr … #br

  35. Sr fr #1 #2 … #br Cost estimation + plan generation method#3: index – cost? • levels of index + • blocks w/ qual. tuples ... case#3: sec. key – non-clust. index HTi + SC(A,r) (actually, pessimistic...)

  36. Cost estimation – arithmetic examples find accounts with branch-name = ‘Perryridge’ account(branch-name, balance, ...)

  37. Arithm. examples – cont’d • n-account = 10,000 tuples • f-account = 20 tuples/block • V(balance, account) = 500 distinct values • V(branch-name, account) = 50 distinct values • for branch-index: fanout fi = 20

  38. Arithm. examples • Q1: cost of seq. scan? • A1: 500 disk accesses • Q2: assume a clustering index on branch-name – cost?

  39. HTi Cost estimation + plan generation Sr method#3: index – cost? • levels of index + • blocks w/ qual. tuples #1 fr #2 case#2: sec. key – clustering index HTi + SC(A,r)/fr … #br

  40. Arithm. examples • A2: HTi + SC(branch-name, account)/f-account • HTi: 50 values, with index fanout 20 -> HT=2 levels (log(50)/log(20) = 1+) • SC(..)= # qualified records = • nr/V(A,r) = 10,000/50 = 200 tuples • SC/f: spanning 200/20 blocks = 10 blocks

  41. Arithm. examples • A2 final answer: 2+10= 12 block accesses • (vs. 500 block accesses of seq. scan) • footnote: in all fairness • seq. disk accesses: ~2msec or less • random disk accesses: ~10msec

  42. Overview - detailed • Motivation - Why q-opt? • Equivalence of expressions • Cost estimation • Cost of indices • Join strategies

  43. 2-way joins algorithm(s) for r JOIN s? nr, ns tuples each r(A, ...) s(A, ......) nr ns

  44. 2-way joins Algorithm #0: (naive) nested loop (SLOW!) for each tuple tr of r for each tuple ts of s print, if they match r(A, ...) s(A, ......) nr ns

  45. 2-way joins Algorithm #0: why is it bad? how many disk accesses (‘br’ and ‘bs’ are the number of blocks for ‘r’ and ‘s’)? r(A, ...) s(A, ......) nr ns nr*bs + br

  46. 2-way joins Algorithm #1: Blocked nested-loop join read in a block of r read in a block of s print matching tuples cost: br + br * bs r(A, ...) s(A, ......) nr, br ns records, bs blocks

  47. 2-way joins Arithmetic example: nr = 10,000 tuples, br = 1,000 blocks ns = 1,000 tuples, bs = 200 blocks alg#0: 2,001,000 d.a. alg#1: 201,000 d.a. r(A, ...) s(A, ......) 10,000 1,000 1,000 records, 200 blocks

  48. 2-way joins Observation1: Algo#1: asymmetric: cost: br + br * bs- reverse roles: cost= bs + bs*br Best choice? smallest relation in outer loop r(A, ...) s(A, ......) nr, br ns records, bs blocks

  49. 2-way joins Other algorithm(s) for r JOIN s? nr, ns tuples each r(A, ...) s(A, ......) nr ns

  50. 2-way joins - other algo’s sort-merge sort ‘r’; sort ‘s’; merge sorted versions (good, if one or both are already sorted) r(A, ...) s(A, ......) nr ns

More Related