1 / 116

CS411 Database Systems

CS411 Database Systems. 10: Query Processing. Query Execution. Outline. Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms. Query Execution. Query or update. User/ Application. Query compiler.

orly
Download Presentation

CS411 Database Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS411Database Systems 10: Query Processing

  2. Query Execution

  3. Outline • Logical/physical operators • Cost parameters and sorting • One-pass algorithms • Nested-loop joins • Two-pass algorithms

  4. Query Execution Query or update User/ Application Query compiler Query execution plan Execution engine Record, index requests Index/record mgr. Page commands Buffer manager Read/write pages Storage manager storage

  5. Logical v.s. Physical Operators • Logical operators • what they do • e.g., union, selection, project, join, grouping • Physical operators • how they do it • e.g., nested loop join, sort-merge join, hash join, index join

  6. Query Execution Plans SELECT S.sname FROM Purchase P, Person Q WHERE P.buyer=Q.name AND Q.city=‘urbana’ AND Q.phone > ‘5430000’ buyer  City=‘urbana’ phone>’5430000’ Query Plan: • logical tree • implementation choice at every node • scheduling of operations. (Simple Nested Loops) Buyer=name Person Purchase (Table scan) (Index scan) Some operators are from relational algebra, and others (e.g., scan, group) are not.

  7. How do We Combine Operations? • The iterator model. Each operation is implemented by 3 functions: • Open: sets up the data structures and performs initializations • GetNext: returns the the next tuple of the result. • Close: ends the operations. Cleans up the data structures. • Enables pipelining! • Contrast with data-driven materialize model.

  8. Cost Parameters • Cost parameters • M = number of blocks that fit in main memory • B(R) = number of blocks holding R • T(R) = number of tuples in R • V(R,a) = number of distinct values of the attribute a • Estimating the cost: • Important in optimization (next lecture) • Compute I/O cost only • We compute the cost to read the tables • We don’t compute the cost to write the result (because pipelining)

  9. Reminder: Sorting • Two pass multi-way merge sort • Step 1: • Read M blocks at a time, sort, write • Result: have runs of length M on disk • Step 2: • Merge M-1 at a time, write to disk • Result: have runs of length M(M-1)M2 • Cost: 3B(R), Assumption: B(R)  M2

  10. Scanning Tables • The table is clustered (I.e. blocks consists only of records from this table): • Table-scan: if we know where the blocks are • Index scan: if we have a sparse index to find the blocks • The table is unclustered (e.g. its records are placed on blocks with other tables) • May need one read for each record

  11. Cost of the Scan Operator • Clustered relation: • Table scan: B(R); to sort: 3B(R) • Index scan: B(R); to sort: B(R) or 3B(R) • Unclustered relation • T(R); to sort: T(R) + 2B(R)

  12. One pass algorithm

  13. One-pass Algorithms Selection s(R), projection P(R) • Both are tuple-at-a-Time algorithms • Cost: B(R) Unary operator Input buffer Output buffer

  14. One-pass Algorithms Duplicate elimination d(R) • Need to keep a dictionary in memory: • balanced search tree • hash table • etc • Cost: B(R) • Assumption: B(d(R)) <= M

  15. One-pass Algorithms Grouping: gcity, sum(price) (R) • Need to keep a dictionary in memory • Also store the sum(price) for each city • Cost: B(R) • Assumption: number of cities fits in memory

  16. One-pass Algorithms Binary operations: R ∩ S, R U S, R – S • Assumption: min(B(R), B(S)) <= M • Scan one table first, then the next, eliminate duplicates • Cost: B(R)+B(S)

  17. Nested loop join

  18. Nested Loop Joins • Tuple-based nested loop R S for each tuple r in R do for each tuple s in S do if r and s join then output (r,s) • Cost: T(R) T(S), sometimes T(R) B(S)

  19. Nested Loop Joins • Block-based Nested Loop Join for each (M-1) blocks bs of S do for each block br of R do for each tuple s in bs do for each tuple r in br do if r and s join then output(r,s)

  20. . . . Nested Loop Joins Join Result R & S Hash table for block of S (k < B-1 pages) . . . . . . Output buffer Input buffer for R

  21. Nested Loop Joins • Block-based Nested Loop Join • Cost: • Read S once: cost B(S) • Outer loop runs B(S)/(M-1) times, and each time need to read R: costs B(S)B(R)/(M-1) • Total cost: B(S) + B(S)B(R)/(M-1) • Notice: it is better to iterate over the smaller relation first • R S: R=outer relation, S=inner relation

  22. Two pass algorithm

  23. Two-Pass Algorithms Based on Sorting Duplicate elimination d(R) • Simple idea: sort first, then eliminate duplicates • Step 1: sort runs of size M, write • Cost: 2B(R) • Step 2: merge M-1 runs, but include each tuple only once • Cost: B(R) • Some complications... • Total cost: 3B(R), Assumption: B(R)<= M2

  24. Two-Pass Algorithms Based on Sorting Grouping: gcity, sum(price) (R) • Same as before: sort, then compute the sum(price) for each group • As before: compute sum(price) during the merge phase. • Total cost: 3B(R) • Assumption: B(R)<= M2

  25. Two-Pass Algorithms Based on Sorting Binary operations: R ∩ S, R U S, R – S • Idea: sort R, sort S, then do the right thing • A closer look: • Step 1: split R into runs of size M, then split S into runs of size M. Cost: 2B(R) + 2B(S) • Step 2: merge M/2 runs from R; merge M/2 runs from S; ouput a tuple on a case by cases basis • Total cost: 3B(R)+3B(S) • Assumption: B(R)+B(S)<= M2

  26. Two-Pass Algorithms Based on Sorting Join R S • Start by sorting both R and S on the join attribute: • Cost: 4B(R)+4B(S) (because need to write to disk) • Read both relations in sorted order, match tuples • Cost: B(R)+B(S) • Difficulty: many tuples in R may match many in S • If at least one set of tuples fits in M, we are OK • Otherwise need nested loop, higher cost • Total cost: 5B(R)+5B(S) • Assumption: B(R)<= M2, B(S)<= M2

  27. Two-Pass Algorithms Based on Sorting Join R S • If the number of tuples in R matching those in S is small (or vice versa) we can compute the join during the merge phase • Total cost: 3B(R)+3B(S) • Assumption: B(R)+ B(S)<= M2

  28. Relation R Partitions OUTPUT 1 1 2 INPUT 2 hash function h . . . M-1 M-1 M main memory buffers Disk Disk Two Pass Algorithms Based on Hashing • Idea: partition a relation R into buckets, on disk • Each bucket has size approx. B(R)/M • Does each bucket fit in main memory ? • Yes if B(R)/M <= M, i.e. B(R) <= M2 1 2 B(R)

  29. Hash Based Algorithms for d • Recall: d(R) = duplicate elimination • Step 1. Partition R into buckets • Step 2. Apply d to each bucket (may read in main memory) • Cost: 3B(R) • Assumption:B(R) <= M2

  30. Hash Based Algorithms for g • Recall: g(R) = grouping and aggregation • Step 1. Partition R into buckets • Step 2. Apply g to each bucket (may read in main memory) • Cost: 3B(R) • Assumption:B(R) <= M2

  31. Hash-based Join • R S • Recall the main memory hash-based join: • Scan S, build buckets in main memory • Then scan R and join

  32. Partitioned Hash Join R S • Step 1: • Hash S into M buckets • send all buckets to disk • Step 2 • Hash R into M buckets • Send all buckets to disk • Step 3 • Join every pair of buckets

  33. Original Relation Partitions OUTPUT 1 1 2 INPUT 2 hash function h . . . M-1 M-1 B main memory buffers Disk Disk Partitions of R & S Join Result Hash table for partition Si ( < M-1 pages) hash fn h2 h2 Output buffer Input buffer for Ri B main memory buffers Disk Disk PartitionedHash-Join • Partition both relations using hash fn h: R tuples in partition i will only match S tuples in partition i. • Read in a partition of R, hash it using h2 (<> h!). Scan matching partition of S, search for matches.

  34. Partitioned Hash Join • Cost: 3B(R) + 3B(S) • Assumption: min(B(R), B(S)) <= M2

  35. Hybrid Hash Join Algorithm • When we have more memory: B(S) << M2 • Partition S into k buckets • But keep first bucket S1 in memory, k-1 buckets to disk • Partition R into k buckets • First bucket R1 is joined immediately with S1 • Other k-1 buckets go to disk • Finally, join k-1 pairs of buckets: • (R2,S2), (R3,S3), …, (Rk,Sk)

  36. Hybrid Join Algorithm • How big should we choose k ? • Average bucket size for S is B(S)/k • Need to fit B(S)/k + (k-1) blocks in memory • B(S)/k + (k-1) <= M • k slightly smaller than B(S)/M

  37. Hybrid Join Algorithm • How many I/Os ? • Recall: cost of partitioned hash join: • 3B(R) + 3B(S) • Now we save 2 disk operations for one bucket • Recall there are k buckets • Hence we save 2/k(B(R) + B(S)) • Cost: (3-2/k)(B(R) + B(S)) = (3-2M/B(S))(B(R) + B(S))

  38. a a a a a a a a a a Indexed Based Algorithms • In a clustered index all tuples with the same value of the key are clustered on as few blocks as possible

  39. Index Based Selection • Selection on equality: sa=v(R) • Clustered index on a: cost B(R)/V(R,a) • Unclustered index on a: cost T(R)/V(R,a)

  40. Index Based Selection • Example: B(R) = 2000, T(R) = 100,000, V(R, a) = 20, compute the cost of sa=v(R) • Cost of table scan: • If R is clustered: B(R) = 2000 I/Os • If R is unclustered: T(R) = 100,000 I/Os • Cost of index based selection: • If index is clustered: B(R)/V(R,a) = 100 • If index is unclustered: T(R)/V(R,a) = 5000 • Notice: when V(R,a) is small, then unclustered index is useless

  41. Index Based Join • R S • Assume S has an index on the join attribute • Iterate over R, for each tuple fetch corresponding tuple(s) from S • Assume R is clustered. Cost: • If index is clustered: B(R) + T(R)B(S)/V(S,a) • If index is unclustered: B(R) + T(R)T(S)/V(S,a)

  42. Index Based Join • Assume both R and S have a sorted index (B+ tree) on the join attribute • Then perform a merge join (called zig-zag join) • Cost: B(R) + B(S)

  43. Query Optimization

  44. Optimization • Chapter 16 • At the heart of the database engine • Step 1: convert the SQL query to some logical plan • Step 2: find a better logical plan, find an associated physical plan

  45. SQL –> Logical Query Plans

  46. Converting from SQL to Logical Plans Select a1, …, an From R1, …, Rk Where C Pa1,…,an(sC(R1 R2 … Rk)) Select a1, …, an From R1, …, Rk Where C Group by b1, …, bl Pa1,…,an(gb1, …, bm, aggs (sC(R1 R2 … Rk)))

  47. Converting Nested Queries Selectdistinct product.name From product Where product.maker in (Select company.name From company where company.city=“Urbana”) Selectdistinct product.name From product, company Where product.maker = company.name AND company.city=“Urbana”

  48. Converting Nested Queries Selectdistinct x.name, x.maker From product x Where x.color= “blue” AND x.price >= ALL (Select y.price From product y Where x.maker = y.maker AND y.color=“blue”) How do we convert this one to logical plan ?

More Related