1 / 55

Principles of Query Processing

CS5226 Week 5. Principles of Query Processing. Pang Hwee Hwa School of Computing, NUS. Application Programmer (e.g., business analyst, Data architect). Application. Sophisticated Application Programmer (e.g., SAP admin). Query Processor. Indexes. Storage Subsystem. Concurrency Control.

tariq
Download Presentation

Principles of Query Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS5226 Week 5 Principles of Query Processing Pang Hwee Hwa School of Computing, NUS H. Pang / NUS

  2. ApplicationProgrammer(e.g., business analyst, Data architect) Application SophisticatedApplicationProgrammer(e.g., SAP admin) QueryProcessor Indexes Storage Subsystem Concurrency Control Recovery DBA,Tuner Operating System Hardware[Processor(s), Disk(s), Memory] H. Pang / NUS

  3. Overview of Query Processing Database Statistics Cost Model Query Optimizer Query Evaluator Parsed Query QEP Parser High Level Query Query Result H. Pang / NUS

  4. Outline • Processing relational operators • Query optimization • Performance tuning H. Pang / NUS

  5. Projection Operator • R.attrib, .. (R) • Implementation is straightforward SELECT bid FROM Reserves R WHERE R.rname < ‘C%’ H. Pang / NUS

  6. Selection Operator • R.attr op value (R) • Size of result = R * selectivity • Scan • Clustered index: Good • Non-clustered index: • Good for low selectivity • Worse than scan for high selectivity SELECT * FROM Reserves R WHERE R.rname < ‘C%’ H. Pang / NUS

  7. Example of Join SELECT * FROM Sailors R, Reserve S WHERE R.sid=S.sid H. Pang / NUS

  8. Notations • |R| = number of pages in outer table R • ||R|| = number of tuples in outer table R • |S| = number of pages in inner table S • ||S|| = number of tuples in inner table S • M = number of main memory pages allocated H. Pang / NUS

  9. 1 scan per R tuple |S| pages per scan Simple Nested Loop Join R S Tuple ||R|| tuples H. Pang / NUS

  10. Simple Nested Loop Join • Scan inner table S per R tuple: ||R|| * |S| • Each scan costs |S| pages • For ||R|| tuples • |R| pages for outer table R • Total cost = |R| + ||R|| * |S| pages • Not optimal! H. Pang / NUS

  11. 1 scan per R block |S| pages per scan Block Nested Loop Join R S M – 2 pages |R| / (M – 2) blocks H. Pang / NUS

  12. Block Nested Loop Join • Scan inner table S per block of (M – 2) pages of R tuples • Each scan costs |S| pages • |R| / (M – 2) blocks of R tuples • |R| pages for outer table R • Total cost = |R| + |R| / (M – 2) * |S| pages • R should be the smaller table H. Pang / NUS

  13. 1 probe per R tuple Index Nested Loop Join R Index S Tuple ||R|| tuples H. Pang / NUS

  14. Index Nested Loop Join • Probe S index for matching S tuples per R tuple • Probe hash index: 1.2 I/Os • Probe B+ tree: 2-4 I/Os, plus retrieve matching S tuples: 1 I/O • For ||R|| tuples • |R| pages for outer table R • Total cost = |R| + ||R|| * index retrieval • Better than Block NL join only for small number of R tuples H. Pang / NUS

  15. Sort Merge Join • External sort R • External sort S • Merge sorted R and sorted S H. Pang / NUS

  16. Merge pass 2 R2,1 Merge pass 1 … R1,1 R1,2 R1,M-1 Split pass R … R0,M-1 R0,M … R0,1 External Sort R (m-1)-way merge Size of R0,i = M, # R0,i’s = |R|/M # merge passes = logM-1 |R|/M Cost per pass = |R| input + |R| output = 2 |R| Total cost = 2 |R| (logM-1 |R|/M + 1)includingsplit pass H. Pang / NUS

  17. Sort Merge Join • External-sort R: 2 |R| * (logM-1 |R|/M + 1) • Split R into |R|/M sorted runs each of size M: 2 |R| • Merge up to (M – 1) runs repeatedly • logM-1 |R|/M passes, each costing 2 |R| • External-sort S: 2 |S| * (logM-1 |S|/M + 1) • Merge matching tuples from sorted R and S: |R| + |S| • Total cost = 2 |R| * (logM-1 |R|/M + 1) + 2 |S| * (logM-1 |S|/M + 1) + |R| + |S| • If |R| < M*(M-1), cost = 5 * (|R| + |S|) H. Pang / NUS

  18. GRACE Hash Join S 0 1 2 3 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0 bucketID = X mod 4 Join on R.X = S.X 1 R S = R0 S0 + R1 S1 + R2 S2 + R3 S3 R 2 3 H. Pang / NUS

  19. Original Relation Partitions OUTPUT 1 1 2 INPUT 2 hash function h1 . . . M-1 M-1 M main memory buffers Disk Disk GRACE Hash Join – Partition Phase • R  (M – 1) partitions, each of size |R| / (M – 1) H. Pang / NUS

  20. Partitions of R & S Join Result Hash table for partition Ri (< M-1 pages) hash fn h2 h2 Output buffer Input buffer for Si B main memory buffers Disk Disk GRACE Hash Join – Join Phase Partition must fit in memory: |R| / (M – 1) < M -1 H. Pang / NUS

  21. GRACE Hash Join Algorithm • Partition phase: 2 (|R| + |S|) • Partition table R using hash function h1: 2 |R| • Partition table S using hash function h1: 2 |S| • R tuples in partition i will match only S tuples in partition I • R  (M – 1) partitions, each of size |R| / (M – 1) • Join phase: |R| + |S| • Read in a partition of R (|R| / (M – 1) < M -1) • Hash it using function h2 (<> h1!) • Scan corresponding S partition, search for matches • Total cost = 3 (|R| + |S|) pages • Condition: M > √f|R|, f ≈ 1.2 to account for hash table H. Pang / NUS

  22. Summary of Join Operator • Simple nested loop: |R| + ||R|| * |S| • Block nested loop: |R| + |R| / (M – 2) * |S| • Index nested loop: |R| + ||R|| * index retrieval • Sort-merge: 2 |R| * (logM-1 |R|/M + 1) + 2 |S| * (logM-1 |S|/M + 1) + |R| + |S| • GRACE hash: 3 * (|R| + |S|) • Condition: M > √f|R| H. Pang / NUS

  23. Overview of Query Processing Database Statistics Cost Model Query Optimizer Query Evaluator Parsed Query QEP Parser High Level Query Query Result H. Pang / NUS

  24. sname rating > 5 bid=100 sid=sid Sailors Reserves Query Optimization • Given: An SQL query joining n tables • Dream: Map to most efficient plan • Reality: Avoid rotten plans • State of the art: • Most optimizers follow System R’s technique • Works fine up to about 10 joins SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 H. Pang / NUS

  25. Many degrees of freedom Selection: scan versus (clustered, non-clustered) index Join: block nested loop, sort-merge, hash Relative order of the operators Exponential search space! Heuristics Push the selections down Push the projections down Delay Cartesian products System R: Only left-deep trees D C B A Complexity of Query Optimization H. Pang / NUS

  26. Equivalences in Relational Algebra • Selection: - cascade - commutative • Projection: - cascade • Join: - associative - commutative R (S T) (R S) T (R S) (S R) H. Pang / NUS

  27. Equivalences in Relational Algebra • A projection commutes with a selection that only uses attributes retained by the projection • Selection between attributes of the two arguments of a cross-product converts cross-product to a join • A selection on just attributes of R commutes with join R S (i.e., (R S) (R) S ) • Similarly, if a projection follows a join R S, we can `push’ it by retaining only attributes of R (and S) that are needed for the join or are kept by the projection H. Pang / NUS

  28. System R Optimizer • Find all plans for accessing each base table • For each table • Save cheapest unordered plan • Save cheapest plan for each interesting order • Discard all others • Try all ways of joining pairs of 1-table plans; save cheapest unordered + interesting ordered plans • Try all ways of joining 2-table with 1-table • Combine k-table with 1-table till you have full plan tree • At the top, to satisfy GROUP BY and ORDER BY • Use interesting ordered plan • Add a sort node to unordered plan H. Pang / NUS

  29. H. Pang / NUS Source: Selinger et al, “Access Path Selection in a Relational Database Management System”

  30. Note: Only branches for NL join are shown here. Additional branches for other join methods (e.g. sort-merge) are not shown. H. Pang / NUS Source: Selinger et al, “Access Path Selection in a Relational Database Management System”

  31. What is “Cheapest”? • Need information about the relations and indexes involved • Catalogstypically contain at least: • # tuples (NTuples) and # pages (NPages) for each relation. • # distinct key values (NKeys) and NPages for each index. • Index height, low/high key values (Low/High) for each tree index. • Catalogs updated periodically. • Updating whenever data changes is too expensive; lots of approximation anyway, so slight inconsistency ok. • More detailed information (e.g., histograms of the values in some field) are sometimes stored. H. Pang / NUS

  32. Estimating Result Size SELECT attribute list FROM relation list WHERE term1AND ... ANDtermk • Consider a query block: • Maximum # tuples in result is the product of the cardinalities of relations in the FROM clause. • Reduction factor (RF) associated with eachtermireflects the impact of the term in reducing result size • Term col=value has RF 1/NKeys(I) • Term col1=col2 has RF 1/MAX(NKeys(I1), NKeys(I2)) • Term col>value has RF (High(I)-value)/(High(I)-Low(I)) • Resultcardinality = Max # tuples * product of all RF’s. • Implicit assumption that terms are independent! H. Pang / NUS

  33. Cost Estimates for Single-Table Plans • Index I on primary key matches selection: • Cost is Height(I)+1 for a B+ tree, about 1.2 for hash index. • Clustered index I matching one or more selects: • (NPages(I)+NPages(R)) * product of RF’s of matching selects. • Non-clustered index I matching one or more selects: • (NPages(I)+NTuples(R)) * product of RF’s of matching selects. • Sequential scan of file: • NPages(R). • Note:Typically, no duplicate elimination on projections! (Exception: Done on answers if user says DISTINCT.) H. Pang / NUS

  34. (On-the-fly) sname (Sort-Merge Join) sid=sid (Scan; (Scan; write to write to rating > 5 bid=100 temp T2) temp T1) Reserves Sailors Counting the Costs • With 5 buffers, cost of plan: • Scan Reserves (1000) + write temp T1 (10 pages, if we have 100 boats, uniform distribution) • Scan Sailors (500) + write temp T2 (250 pages, if we have 10 ratings). • Sort T1 (2*10*2), sort T2 (2*250*4), merge (10+250), total=2300 • Total: 4060 page I/Os • If we used BNL join, join cost = 10+4*250, total cost = 2770 • If we ‘push’ projections, T1 has only sid, T2 only sid and sname: • T1 fits in 3 pages, cost of BNL drops to under 250 pages, total < 2000 SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 H. Pang / NUS

  35. Exercise • Reserves: 100,000 tuples, 100 tuples per page • With clustered index on bid of Reserves, we get 100,000/100 = 1000 tuples on 1000/100 = 10 pages • Join column sid is a key for Sailors - at most one matching tuple • Decision not to push rating>5 before the join is based on availability of sid index on Sailors • Cost: Selection of Reserves tuples (10 I/Os); for each tuple, must get matching Sailors tuple (1000*1.2); total 1210 I/Os (On-the-fly) sname (On-the-fly) rating > 5 (Index Nested Loops, with pipelining ) sid=sid (Use hash Index on sid) Sailors bid=100 (Use clustered index on sid) Reserves H. Pang / NUS

  36. Query Tuning H. Pang / NUS

  37. Avoid Redundant DISTINCT • DISTINCT usually entails a sort operation • Slow down query optimization because one more “interesting” order to consider • Remove if you know the result has no duplicates SELECT DISTINCT ssnum FROM Employee WHEREdept = ‘information systems’ H. Pang / NUS

  38. Change Nested Queries to Join • Might not use index on Employee.dept • Need DISTINCT if an employee might belong to multiple departments SELECT ssnum FROM Employee WHEREdept IN (SELECT dept FROM Techdept) SELECT ssnum FROM Employee, Techdept WHERE Employee.dept = Techdept.dept H. Pang / NUS

  39. Avoid Unnecessary Temp Tables • Creating temp table causes update to catalog • Cannot use any index on original table SELECT * INTO Temp FROM Employee WHEREsalary > 40000 SELECT ssnum FROM Temp WHERE Temp.dept = ‘information systems’ SELECT ssnum FROM Employee WHERE Employee.dept = ‘information systems’ AND salary > 40000 H. Pang / NUS

  40. Avoid Complicated Correlation Subqueries • Search all of e2 for each e1 record! SELECT ssnum FROM Employee e1 WHERE salary = (SELECT MAX(salary) FROM Employee e2 WHERE e2.dept = e1.dept SELECT MAX(salary) as bigsalary, dept INTO Temp FROM Employee GROUP BY dept SELECT ssnum FROM Employee, Temp WHERE salary = bigsalary AND Employee.dept = Temp.dept H. Pang / NUS

  41. Avoid Complicated Correlation Subqueries • SQL Server 2000 does a good job at handling the correlated subqueries (a hash join is used as opposed to a nested loop between query blocks) • The techniques implemented in SQL Server 2000 are described in “Orthogonal Optimization of Subqueries and Aggregates” by C.Galindo-Legaria and M.Joshi, SIGMOD 2001. > 1000 > 10000 H. Pang / NUS

  42. Join on Clustering and Integer Attributes • Employee is clustered on ssnum • ssnum is an integer SELECT Employee.ssnum FROM Employee, Student WHERE Employee.name = Student.name SELECT Employee.ssnum FROM Employee, Student WHERE Employee.ssnum = Student.ssnum H. Pang / NUS

  43. Avoid HAVING when WHERE is enough • May first perform grouping for all departments! SELECT AVG(salary) as avgsalary, dept FROM Employee GROUP BY dept HAVING dept = ‘information systems’ SELECT AVG(salary) as avgsalary FROM Employee WHERE dept = ‘information systems’ GROUP BY dept H. Pang / NUS

  44. Avoid Views with unnecessary Joins • Join with Techdept unnecessarily CREATE VIEW Techlocation AS SELECT ssnum, Techdept.dept, location FROM Employee, Techdept WHERE Employee.dept = Techdept.dept SELECT dept FROM Techlocation WHERE ssnum = 4444 SELECT dept FROM Employee WHERE ssnum = 4444 H. Pang / NUS

  45. Aggregate Maintenance • Materialize an aggregate if needed “frequently” • Use trigger to update • create trigger updateVendorOutstanding on orders for insert as • update vendorOutstanding • set amount = • (select vendorOutstanding.amount+sum(inserted.quantity*item.price) • from inserted,item • where inserted.itemnum = item.itemnum • ) • where vendor = (select vendor from inserted) ; H. Pang / NUS

  46. Avoid External Loops • No loop: sqlStmt = “select * from lineitem where l_partkey <= 200;” odbc->prepareStmt(sqlStmt); odbc->execPrepared(sqlStmt); • Loop: sqlStmt = “select * from lineitem where l_partkey = ?;” odbc->prepareStmt(sqlStmt); for (int i=1; i<200; i++) { odbc->bindParameter(1, SQL_INTEGER, i); odbc->execPrepared(sqlStmt); } H. Pang / NUS

  47. Avoid External Loops • SQL Server 2000 on Windows 2000 • Crossing the application interface has a significant impact on performance Let the DBMS optimize set operations H. Pang / NUS

  48. Avoid Cursors • No cursor select * from employees; • Cursor DECLARE d_cursor CURSOR FOR select * from employees; OPEN d_cursorwhile (@@FETCH_STATUS = 0) BEGIN FETCH NEXT from d_cursorEND CLOSE d_cursor go H. Pang / NUS

  49. Avoid Cursors • SQL Server 2000 on Windows 2000 • Response time is a few seconds with a SQL query and more than an hour iterating over a cursor H. Pang / NUS

  50. All Select * from lineitem; Covered subset Select l_orderkey, l_partkey, l_suppkey, l_shipdate, l_commitdate from lineitem; Avoid transferring unnecessary data May enable use of a covering index. Retrieve Needed Columns Only H. Pang / NUS

More Related