1 / 60

Optimization and Evaluation of Nested Queries and Procedures

Optimization and Evaluation of Nested Queries and Procedures. Ravindra Guravannavar Advisor: Prof. S. Sudarshan Indian Institute of Technology Bombay. Broad Outline of this Talk. Motivation and problem overview Known approaches and limitations Overview of our approach

zohar
Download Presentation

Optimization and Evaluation of Nested Queries and Procedures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimization and Evaluation of Nested Queries and Procedures Ravindra Guravannavar Advisor: Prof. S. Sudarshan Indian Institute of Technology Bombay

  2. Broad Outline of this Talk • Motivation and problem overview • Known approaches and limitations • Overview of our approach • Technical contributions and results • Related work • Possible directions for future work

  3. Motivation Database queries/updates are often invoked repeatedly with different parameter values in a single decision support task. Reason 1. Nested sub-queries (with correlated evaluation) SELECT o_orderkey, o_custkey FROM orders WHERE o_shipdate NOT IN ( SELECT l_shipdate FROM lineitem WHERE l_orderkey = o_orderkey );

  4. Motivation Database queries/updates are often invoked repeatedly with different parameter values. Database queries/updates are often invoked repeatedly with different parameter values in a single decision support task. Reason 2. Queries inside user-defined functions (UDFs) that are invoked from other queries SELECT * FROM category WHEREcount_items(category-id) > 50;

  5. Motivation Database queries/updates are often invoked repeatedly with different parameter values in a single decision support task. int count_items(int categoryId) { … … SELECT count(item-id) INTO icount FROM item WHERE category-id = :categoryId; … } Procedural logic with embedded queries

  6. Motivation Reason 3: Queries/updates inside stored-procedures, which are called repeatedly by external batch jobs. Reason 4: Queries/updates invoked from imperative loops in application programs or complex procedures. // A procedure to count items in a given category and sub-categories int count_subcat_items(int categoryId) { … while(…) { … SELECT count(item-id) INTO icount FROM item WHERE category-id = :curcat; … } }

  7. Problem Overview Naïve iterative execution of queries and updates is often inefficient Causes: • No sharing of work (e.g., disk IO) • Random IO and poor buffer effects • Network round-trip delays

  8. SELECT o_orderkey, o_custkey FROM orders WHERE o_shipdate NOT IN ( SELECT l_shipdate FROM lineitem WHERE l_orderkey = o_orderkey ); One random read for each outer tuple (assuming all internal pages of the index are cached), avg seek time 4ms. Primary Index on l_orderkey 4 3 2 1 Problem Overview ORDERS 1.5M tuples LINEITEM 6M tuples

  9. Known Approach: Decorrelation • Rewrite a nested query using set operations (such as join, semi-join and outer-join). [Kim82, Dayal87, Murali89, Seshadri96, Galindo01, ElHemali07] • Increases the number of alternative plans available to the query optimizer. • Makes it possible choose set-oriented plans (e.g., plans that use hash or sort-merge join) which are often more efficient than naïve iterative plans.

  10. Gaps in the Known Approach • Decorrelation does not speed up iterative plans; it only enables the use of alternative plans. • For some nested queries an iterative plan can be the best of the available alternatives [Graefe03] • Known decorrelation techniques are not applicable for: • complex nested blocks such as user-defined functions and procedures • repeated query executions from imperative program loops

  11. Our Approach A two-pronged approach • Improve the efficiency of iterative plans by exploiting sorted parameter bindings. • Earlier work [Graefe03], ElHemali[07] and Iyengar[08] consider improved iterative plans through asynchronous IO and caching. • Our work adds new techniques and considers optimization issues • Earlier work, not part of the VLDB08 paper or this talk. • Enable set-oriented execution of repeatedly called stored procedures and UDFs, by rewriting them to accept batches of parameter bindings. • Involves program analysis and rewrite.

  12. Optimizing Iterative Calls to Stored Procedures and UDFs // UDF to count items in a category int count_items(int categoryId) { … SELECT count(item-id) INTO :icount FROM item WHERE category-id = :categoryId; … } • SPs and UDFs are written using procedural extensions to SQL (e.g., PL/SQL) • Called iteratively from • Outer query block • Batch applications • Known decorrelation techniques do not apply (except in verysimple cases)

  13. Optimizing Iterative Calls to Stored Procedures and UDFs // UDF to count items in a category and all // the sub-categories int count_subcat_items(int categoryId) { … while(…) { … … SELECT count(item-id) INTO :icount FROM item WHERE category-id = :curcat; … } } • UDFs can involve complex control-flow and looping. • Our techniques can deal with: • Iterative calls to a UDF from an application. • Iterative execution of queries within a single UDF invocation.

  14. Problem Motivated by Real World Applications • National Informatics Centre (NIC) • Iterative updates from an external Java application • A financial enterprise • SQLJ stored procedures called iteratively • National Securities Depository Ltd. (NSDL) • SPs called repeatedly for EOD processing

  15. Optimizing Iterative Calls to Stored Procedures and UDFs Approach: Use Parameter Batching Repeated invocation of an operation is replaced by a single invocation of its batched form. Batched form: Processes a set of parameters • Enables set-oriented execution of queries/updates inside the procedure • Possible to use alternative plans • Repeated selection  join • Reduced random disk access • Efficient integrity checks and index updates • Reduced network round-trip delays

  16. Effect of Batch Size on Inserts Bulk Load: 1.3 min

  17. Iterative and Set-Oriented Updates on the Server Side • TPC-H PARTSUPP (800,000 records), Clustering index on (partkey, suppkey) • Iterative update of all the records using T-SQL script (each update has an index lookup plan) • Single commit at the end of all updates Takes 1 minute Same update processed as a merge (update … from …) Takes 15 seconds

  18. Batched Forms of Queries/Updates • Batched forms of queries/updates are known Example: SELECT item-id FROM item WHERE category-id=? Batched form: SELECT pb.category-id, item-id FROM param-batch pb LEFT OUTER JOIN item ON pb.category-id = item.category-id; • INSERT : INSERT INTO … SELECT FROM … • UPDATE : SQL MERGE

  19. SQL Merge GRANTMASTER GRANTLOAD merge into GRANTMASTER GM using GRANTLOAD GL on GM.empid=GL.empidwhen matched then update set GM.grants=GL.grants; Notation:Mc1=c1’,c2=c2’,… cn=cn’(r, s)

  20. Batched Forms of Procedures // Batched form of a simple UDF Table count_items_batched( Table params(categoryId) ) { RETURN ( SELECT params.categoryId, item.count(item-id) FROM params LEFT OUTER JOIN item ON params.categoryId = item.categoryId; ); } • Possible to write manually: Time-consuming and error-prone • Our aim: Automate the generation of batched forms

  21. Challenges in Generating Batched Forms of Procedures • Must deal with control-flow, looping and variable assignments • Inter-statement dependencies may not permit batching of desired operations • Presence of “non batch-safe” (order sensitive) operations along with queries to batch Approach: • Equivalence rules for program transformation • Make use of the data dependence graph

  22. Batch Safe Operations • Batched forms – no guaranteed order of parameter processing • Can be a problem for operations having side-effects Batch-Safe operations • All operations without side effects • Also a few operations with side effects • E.g.: INSERT on a table with no constraints • Operations inside unordered loops (e.g., cursor loops with no order-by)

  23. Generating Batched Form of a Procedure Step 1: Create trivial batched form. Transform: procedure p(r) { body of p} To procedure p_batched(pb) { for each record r in pb { < body of p> < collect the return value > } return the collected results paired with corrsp. params; } Step 2: Move the query exec stmts out of the loop and replace them with a call to their batched form.

  24. for each t in r loop insert into orders values (t.order-key, t.order-date,…); end loop; insert into orders select … from r; Rule 1A: Rewriting a Simple Set Iteration Loop where q is any batch-safe operation with qb as its batched form

  25. Rule 1B: Assignment of Scalar Result for each r in t { t.count = select count(itemid) from item where category=t.catid; } merge into t usingbqon t.catid=bq.catid when matched then update set t.count=bq.item-count; bq: select t.catid, count(itemid) as item-count from t left outer join item on item.category=t.catid; * SQL Merge: Lets merging the result of a query into an existing relation

  26. Rewriting Simple Set Iteration Loops Rule 1B: Unconditional invocation with return value SQL Merge

  27. Condition for Invocation Return values Operation to Batch Let b= // Parameter batch Let // Merge the results* Rule 1C: Batching Conditional Statements * SQL Merge: Lets merging the results of a query into an existing relation

  28. Table(T) t; while(p) { ss1 modified to save local variables as a tuple in t } Collect the parameters for each r in t { sq modified to use attributes of r; } Can apply Rule 1A-1C and batch. for each r in t { ss2 modified to use attributes of r; } Process the results Rule 2: Splitting a Loop while (p) { ss1; sq; ss2; } * Conditions Apply

  29. Loop Splitting: An Example while(top > 0) { catid = stack[--top]; count = select count(itemid) from item where category=:catid; total += count; } Table(key, catid, count) t; int loopkey = 0; while(top > 0) { Record r; catid = stack[--top]; r.catid = catid; r.key = loopkey++; t.addRecord(r) } for each r in t order by key { t.count = select count(itemid) from item where category=t.catid; } for each r in t order by key { total += t.count; } // order-by is removed if operation is batch-safe

  30. Rule 2: Pre-conditions • Conditions are on the data dependence graph (DDG) • Nodes: program statements • Edges: data dependencies between statements • Types of dependence edges • Flow (Write  Read), Anti (Read  Write) and Output (Write  Write) • Loop-carried flow/anti/output: Dependencies across loop iterations • Pre-conditions for Rule-2 (Loop splitting) • No loop-carried flow dependencies cross the points at which the loop is split • No loop-carried dependencies through external data (e.g., DB) Formal specification of Rule-2 in paper

  31. Dependency Analysis

  32. More Notation

  33. Yet More Notation

  34. Rule 2

  35. Is rewritten to

  36. Rule 2a contd.

  37. Rule 2a contd.

  38. Cycles of Flow Dependencies

  39. Rule 3: Separating Batch Safe Expressions

  40. Rule 4: Control Dependencies while (…) { item = …; qty = …; brcode = …; if (brcode == 58) brcode = 1; insert into … values (item, qty, brcode); } Store the branching decision in a boolean variable while (…) { item = …; qty = …; brcode = …; boolean cv = (brcode == 58); cv? brcode = 1; cv? insert into … values (item, qty, brcode); }

  41. Cascading of Rules After applying Rule 2 (loop splitting) Table(…) t; while (…) { r.item = …; r.qty = …; r.brcode = …; r.cv = (r.brcode == 58); r.cv=true? r.brcode = 1; t.addRecord(r); } for each r in t { r.cv=true? insert into … values (r.item, r.qty, r.brcode); } insert into .. ( select item, qty, brcode from t where cv=true ); Rule 1C deals with batching conditional statements. Formal spec in the Thesis.

  42. Data Dependencies Flow Dependence Anti Dependence Output Dependence Control Dependence Loop-Carried Need for Reordering Statements (s1)while (category != null) { (s2)item-count = q1(category); (s3)sum = sum + item-count; (s4)category = getParent(category); }

  43. Reordering Statements to Enable Rule 2 while (category != null) { int item-count = q1(category); // Query to batch sum = sum + item-count; category = getParent(category); } Splitting made possible after reordering while (category != null) { int temp = category; category = getParent(category); int item-count = q1(temp); sum = sum + item-count; }

  44. Rule 5a and 5b

  45. Rule 5a and 5b

  46. Statement Reordering X FD+ X FD+ FD+ X X FD+ Case-3B Case-3A Case-2 Case-1 X Cannot exist

  47. Statement Reordering Algorithm reorder: Reorders statements in a loop to enable loop splitting. (Details in thesis.) Theorem: (Not in VLDB08 paper) If a query execution statement does not lie on a true- dependence cycle in the DDG, then algorithm reorder always reorders the statements such that the query execution can be batched. True-dependence cycle: A directed cycle made up of only FD and LCFD edges.

  48. Batching Across Nested Loops • Statement to batch can be present inside nested loops • Rules 6A, 6B handle nested loops • Batch the stmt w.r.t. the inner loop and then w.r.t. the outer loop • Batches as nested tables: use nest and unnest

  49. Batching Across Multiple Levels while(…) { …. while(…) { ... q(v1, v2, … vn); … } … } while(…) { …. Table t (…); while(…) { ... } qb(t); … } Batch q w.r.t inner loop

  50. Parameter Batches as Nested Tables

More Related