Automatic Physical Design Tuning: Workload as a Sequence

Automatic Physical Design Tuning:Workload as a Sequence Sanjay Agrawal, Microsoft Research Eric Chu, University of Wisconsin-Madison Vivek Narasayya, Microsoft Research

Automatic Physical Design Tuning • DB applications more complex and varied. • Considerable time spent on tuning. • Reduce cost of ownership of RDBMS. • Automatically recommend physical design. • Supported by DB vendors. • Database Engine Tuning Advisor, Microsoft • Design Advisor, IBM • SQL Access Advisor, Oracle SIGMOD 2006

Microsoft Database Engine Tuning Advisor Set of queries, updates Applications Workload Query Optimizer (extended) Database Engine Tuning Advisor “What-if” Set of indexes, materialized views, horizontal partitions Microsoft SQL Server 2005 Recommendation SIGMOD 2006

Create Indexes Drop Indexes Create Indexes Queries Day Updates Night Queries Day Workload as a Sequence: Motivation • Data warehousing • Query by day, update at night. • Set: No index recommended when update costs outweigh benefits. • Sequence: May exploit benefits of indexes without incurring update costs. • Insert “create” and “drop” of indexes to workload. • Exploit order of statements. SIGMOD 2006

Create Indexes Drop Indexes Create Indexes Set VS Sequence • Set-based • Recommendation is robust to changes in order of statement arrival. • Can miss good recommendations compared to sequenced-based approach. • Outputs are different • Set: what indexes to create or drop? • Sequence: what indexes to create or drop and where? Queries Updates Queries SIGMOD 2006

Model Workload as a Sequence • Motivation • Problem Definition • Optimal Algorithm • Disjoint Sequences • Greedy-SEQ • Experiments SIGMOD 2006

C0 C1 C2 C3 CN CN+1 S1 S2 SN S3 Problem Setting Workload: S = [S1, S2, …, SN] • Cost(Si,Ci) – cost of executing Si with Ci. • TC(C1, C2) – transition cost • Sequence execution cost • Nk=1((Cost(Sk,Ck) + TC(Ck-1,Ck)) + TC (CN,CN+1) Si{Select, Insert, Delete, Update} SIGMOD 2006

Problem Definition • Given: • Database D, workload W = [S1, …, SN], initial configuration C0, and storage bound M. • Find configurations C1, C2, …, CN+1 such that • Minimize sequence execution cost: Nk=1((Cost(Sk,Ck) + TC(Ck-1,Ck)) + TC (CN,CN+1) • Storage of Ci≤ M, for all i. SIGMOD 2006

Search Space • Given N statements and M indexes • Sequence-based tuning • 2M distinct configurations for each statement. • 2M(N+1) possible execution sequences. • Set-based tuning • 2M configurations. SIGMOD 2006

Model Workload as a Sequence • Motivation • Problem Definition • Optimal Algorithm • Disjoint Sequences • Greedy Heuristic • Experiments SIGMOD 2006

S1 S2 SN 0 { } { } { } SOURCE 0 0 DESTINATION Ic { } { } Id Ic Id {I} 0 {I} {I} Optimal Algorithm for Single-Index Case • Node costs: Cost(Si, { }) and Cost(Si,{I}). • Edge costs: 0, IC, and ID. • Cost of shortest path includes node and edge costs. DAG for single index, N statements SIGMOD 2006

General Case – Multiple Indexes S1 S2 SN EXHAUSTIVE • At each stage, enumerate all possible configurations from the set of indexes. • Algorithm linear in the number of nodes and edges of DAG. • However, number of nodes in DAG is exponential in the number of indexes. • M indexes => O(N*2M) nodes and O(N*2M) edges. CF1 CF2 CFN C0 Ci1 Ci2 CiN CN+1 C11 C12 C1N C01 C02 C0N SIGMOD 2006

Sequence, Constraints Candidateset of structures Solve sequence using EXHAUSTIVE Recommendation Optimal Solution SIGMOD 2006

Search-Space Pruning Techniques to reduce number of nodes: • Cost-based Pruning • Leverages shortest-path solutions of individual indexes. • Prunes configurations at each stage without loss of optimality. • Disjoint Sequences • Divide-and-conquer approach. • Splits the input sequence and candidate index set. • Greedy-SEQ • Guarantees a polynomial number of nodes. SIGMOD 2006

Exploiting Disjoint Sequences • Two sequences X and Y are disjoint if they do not share any statements AND indexes. • Disjoint sequences are common • E.g., server hosts multiple applications that touch different databases. • Approach: • Split workload into disjoint sequences. • Solve each sequence independently. • Merge to get final solution. • Idea: DAG for each disjoint sequence has fewer nodes. SIGMOD 2006

W {I1,I2,I3} W1 {I1} S3 S4 S5 S6 S1 S2 S7 S3 S4 S1 W2 {I2} S5 S6 S2 W3 {I3} S7 Efficiency Gain with Disjoint Sequences 8 nodes at each stage 2 nodes at each stage for each sequence SIGMOD 2006

SRC S1 S3 S4 DEST I1c I1d { } {I1} {I1} { } { } W1 = [S1,S3,S4] S2 S5 S6 DEST SRC I2c I2d { } { } {I2} {I2} { } S7 DEST SRC I3c {I3} {I3} W2 = [S2,S5,S6] { } W3 = [S7] S2 S3 S4 S5 S6 S7 DEST SRC S1 {I1,I2} {I1,I2} {I2} { } {I2} { } {I3} {I3} {I1} Pu is optimal when there are no storage violations. Merge solutions of W1, W2, and W3: No storage violations SIGMOD 2006

{ } Pu is not a valid solution as it has configurations with storage violation. DEST S4 S5 S6 S7 DEST SRC S1 S2 S3 SRC S1 S2 S3 S4 S5 S6 S7 {I1} {I1,I2} { } {I1} {I2} {I2} {I2} { } {I3} {I3} {I2} {I1,I2} {I2} { } {I3} {I3} {I1} Merge in the presence of storage violation • Suppose storage bound allows only 1 index. Pu’= Merge P1, P2 and P3 to get a valid solution. • Note that cost of Pu is a lower bound on cost of any valid solution. SIGMOD 2006

Apply Split operator to get disjoint sequences Solve each sequence independently using EXHAUSTIVE Merge results of disjoint sequences Recommendation Solution with Split and Merge Sequence, Constraints Candidateset of structures or GREEDY-SEQ SIGMOD 2006

Greedy Approach • Goal: • Explore a polynomial number of good configurations. • Run shortest path over the DAG constructed with these configurations. • Solution close to optimal. • Greedy-SEQ: adaptation of existing greedy technique for the sequence model. SIGMOD 2006

Greedy-SEQ • Steps of Greedy-SEQ: • Get optimal solution for each index. Record configurations. • Initialize current best to be the lowest-cost solution seen so far. • Improve current best by combining with other solutions and resetting current best. Record new configurations of current best. • Repeat until no more improvement. • Run shortest-path over configurations collected. SIGMOD 2006

S0 S1 S2 SK SL SN SN+1 I1 {I1} {I1} {} {} {} {I1} {} I2 {} {I2} {I2} {I2} {} {} {} {I1} {I1} {I1} {} {} {I2} I1,I2 {I2} {I2} {} {} {} {I1,I2} {I1,I2} Combining Two Single-Index Solutions SIGMOD 2006

S0 S1 S2 SK SL SN SN+1 I1 {I1} {I1} {} {} {} {I1} {} I2 {} {I2} {I2} {I2} {} {} {} {I1} {I1} {I1} {} {} {I2} {I2} {} {} {} {I1,I2} {I1,I2} Combining Two Single-Index Solutions {I2} I1,I2 SIGMOD 2006

Greedy-SEQ: Greedy Approach • Get optimal solution for each index. Record configurations. • Initialize current best to be the lowest-cost solution seen so far. • Improve current best by combining with other solutions and resetting current best. Record new configurations of current best. • Repeat Step 3 until no more improvement. • Run shortest-path over configurations collected. SIGMOD 2006

Sequence, Constraints Candidateset of structures Apply split operator to get disjoint sequences Apply cost-based pruning oneach sequence Solve each sequence independently using EXHAUSTIVE or GREEDY-SEQ Merge results of disjoint sequences Recommendation End-to-End Solution SIGMOD 2006

Sequence VS Set-based approaches • % improvement relative to the optimal set-based solution. • Sequence is better in the presence of updates and/or storage bound is low. SIGMOD 2006

Greedy-SEQ VS Exhaustive • Greedy-SEQ’s much faster with minimal degradation in quality. SIGMOD 2006

Effectiveness of Split and Merge • With split and merge (SPMR) VS without (WO-SPMR) SIGMOD 2006

Conclusion • Sequence model allows more optimization opportunities than set model. • Model the problem as finding the shortest path over a DAG. • Heuristics give nearly optimal solutions with much better performance. SIGMOD 2006

Automatic Physical Design Tuning: Workload as a Sequence