1 / 16

ICS 421 Spring 2010 Query Evaluation (ii)

ICS 421 Spring 2010 Query Evaluation (ii). Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa. What do these queries have in common ?. SELECT S.sname FROM Sailors S WHERE S.rating >5 ORDER BY S.age. SELECT DISTINCT S.sname FROM Sailors S.

halen
Download Presentation

ICS 421 Spring 2010 Query Evaluation (ii)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICS 421 Spring 2010Query Evaluation (ii) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa Lipyeow Lim -- University of Hawaii at Manoa

  2. What do these queries have in common ? SELECTS.sname FROMSailors S WHERES.rating>5 ORDER BYS.age SELECTDISTINCTS.sname FROMSailors S SELECTS.age, AVG(S.rating) FROMSailors S GROUP BY S.age Lipyeow Lim -- University of Hawaii at Manoa

  3. The Sort Operator • Sorting is a classic problem in computer science! • Data requested in sorted order • e.g., find students in increasing gpa order • Sorting is first step in bulk loading B+ tree index. • Sorting useful for eliminating duplicate copies in a collection of records (Why?) • Sort-merge join algorithm involves sorting. • Problem: sort 100Gb of data with 1Gb of RAM. • why not virtual memory? Lipyeow Lim -- University of Hawaii at Manoa

  4. Two-Way External Merge Sort • Pass 0: • Read a page, sort it in memory, write it to disk • Only one buffer page is needed • Pass 1, 2, 3, 4 …: • Read two (sorted) pages, merge them to fill output page, flush output page when full. • 2 input pages and 1 output page are needed INPUT 1 OUTPUT INPUT 2 Main memory buffers Disk Disk Lipyeow Lim -- University of Hawaii at Manoa

  5. Two-Way Merge Sort: Example 6,2 2 Input file 3,4 9,4 8,7 5,6 3,1 PASS 0 1,3 2 1-page runs 3,4 2,6 4,9 7,8 5,6 PASS 1 4,7 1,3 2,3 2-page runs 8,9 5,6 2 4,6 PASS 2 2,3 4,4 1,2 4-page runs 6,7 3,5 6 8,9 PASS 3 1,2 2,3 3,4 8-page runs 4,5 6,6 7,8 9 Lipyeow Lim -- University of Hawaii at Manoa

  6. Two-Way Merge Sort: Analysis 6,2 2 Input file 3,4 9,4 8,7 5,6 3,1 • Input file has N pages • Each pass reads N pages and writes N pages. • The number of passes = log2 N + 1 • So total cost is = 2N(log2 N + 1) • Idea:Divide and conquer: sort subfiles and merge PASS 0 1,3 2 1-page runs 3,4 2,6 4,9 7,8 5,6 PASS 1 4,7 1,3 2,3 2-page runs 8,9 5,6 2 4,6 PASS 2 2,3 4,4 1,2 4-page runs 6,7 3,5 6 8,9 PASS 3 1,2 2,3 3,4 8-page runs 4,5 6,6 7,8 9 Lipyeow Lim -- University of Hawaii at Manoa

  7. K-Way External Merge Sort • What if we have more memory ? • Sort a file with N pages using B buffer pages: • Pass 0: • read in B pages, sort all B pages in memory, write to disk as 1 run, repeat until all N pages are sorted -- outputs N/B sorted runs • Pass 1,2,…: • Use B-1 buffer pages as input and perform (B-1)-way merge to fill 1 output buffer page. INPUT 1 . . . . . . INPUT 2 . . . OUTPUT INPUT B-1 Disk Disk B Main memory buffers Lipyeow Lim -- University of Hawaii at Manoa

  8. K-Way Merge Sort: Analysis • B=5 buffer pages, N=108 pages • Pass 0: 108/5 =22 sorted runs of 5 pages each • Pass 1: 22/4 = 6 sorted runs of 20 pages each • Pass 2: 6/4 = 2 sorted runs of 80 & 28 pages • Pass 3: 1 sorted file of 108 pages • Number of passes = logB-1N/B+ 1 • Each pass still reads N pages and writes N pages • Total number of I/O’s = 2N * (logB-1N/B+1) Lipyeow Lim -- University of Hawaii at Manoa

  9. Selection Operator • Index vs Table Scan • Multiple Indexes • Eg. Use index(age) & index(rating) for “age>20 AND rating>9” • Intersect RID sets using bloom filters • Eg. Use index(age) & index(rating) for “age>20 OR rating>9” • Union RID sets Lipyeow Lim -- University of Hawaii at Manoa

  10. Projection Operator • Two steps: • Remove unwanted columns • Eliminate duplicates • How do we do step 2 ? • External merge sort • Scan sorted data to remove duplicates • Optimization: combine the 2 steps into merge sort: • Remove unwanted columns in Pass 0. • Subsequent passes can remove duplicates whenever they are encountered. SELECTDISTINCTS.sname FROMSailors S Lipyeow Lim -- University of Hawaii at Manoa

  11. Hash Join R equijoin S on sid • Partition R into k partitions using hash function h1(R.sid) • Partition S into k partitions using hash function h1(S.sid) • Foreach partition i • Build inmemory hash table H(R[i]) for R[i] using h2(R.sid) • Foreach row in S[i] • Probe H(R[i]) • Output join tuples <r,s> • Works only on equi-joins • Total I/Os = 2*NPages(R) + 2*NPages(S) + NPages(R) + NPages(S) = 3 * [Npages(R) + Npages(S)] • Can be applied in a distributed DBMS with hash partitions on the join attribute! Lipyeow Lim -- University of Hawaii at Manoa

  12. Example SELECTS.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 ANDS.rating>5 πS.sname R.sid=S.sid On the fly • Nested Loop Join cost 1K+ 100K*500 • On the fly selection and project does not incur any disk access. • Total disk access = 500001K (worst case) S.rating>5 AND R.bid=100 On the fly Nested Loop Join Reserves Sailors (SCAN) (SCAN) Lipyeow Lim -- University of Hawaii at Manoa

  13. Example: Predicate Pushdown SELECTS.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 ANDS.rating>5 πS.sname 10% 50% R.sid=S.sid On the fly • Nested Loop Join requires materializing the inner table as T1. • With 50% selectivity, T1 has 250 pages • With 10% selectivity, outer “table” in join has 10K tuples • Disk accesses for scans = 1000 + 500 • Writing T1 = 250 • NLJoin = 10K * 250 • Total disk access = 2500.175 K (worst case) Nested Loop Join Temp T1 R.bid=100 S.rating>5 (SCAN) (SCAN) Reserves Sailors What happens if we make the left leg the inner table of the join ? Lipyeow Lim -- University of Hawaii at Manoa

  14. Example: Sort Merge Join SELECTS.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 ANDS.rating>5 πS.sname 10% 50% R.sid=S.sid On the fly • Sort Merge Join requires materializing both legs for sorting. • With 50% selectivity, T1 has 100 pages • With 10% selectivity, T2 has 250 pages • Disk accesses for scans = 1000 + 500 • Writing T1 & T2 = 100 + 250 • Sort Merge Join = 100 log 100 + 250 log 250 + 100+250 (assume 10 way merge sort) • Total disk access = 52.8 K Sort Merge Join Temp T1 Temp T2 R.bid=100 S.rating>5 (SCAN) (SCAN) Reserves Sailors What happens if we make the left leg the inner table of the join ? Lipyeow Lim -- University of Hawaii at Manoa

  15. Example: Index Nested Loop Join SELECTS.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 ANDS.rating>5 πS.sname 10% 50% R.sid=S.sid On the fly • With 10% selectivity, selection on R has 10K tuples • Disk accesses for scan = 1000 • Index Nested Loop Join = 10K*( 1 + log10 500) = 37K • Total disk access = 38 K S.rating>5 On the fly Index nested Loop Using Index on S.sid R.bid=100 (SCAN) Reserves Sailors What happens if we make the left leg the inner table of the join ? Lipyeow Lim -- University of Hawaii at Manoa

  16. Join Ordering 500 10K 30K 10K • Independent of what join algorithm is chosen, the order in which joins are perform affects the performance. • Rule of thumb: do the most “selective” join first • In practice, left deep trees (eg. the right one above) are preferred --- why ? A C 20K 30K 10K 20K C B B A Lipyeow Lim -- University of Hawaii at Manoa

More Related