CS 440 Database Management Systems

CS 440 Database Management Systems Query Optimization

Today’s lecture Past lectures DBMS Architecture User/Web Forms/Applications/DBA query transaction Query Parser Transaction Manager Query Rewriter Logging & Recovery Query Optimizer Lock Manager Query Executor Files & Access Methods Lock Tables Buffers Buffer Manager Main Memory Storage Manager Storage

Query Rewriting/Optimization • Parse SQL into a logical query plan. • More than one logical plan! • Pick the fastest -> optimization • Convert the logical query plan to a physical query plan. • More than one physical plan! • Pick the fastest -> optimization

From SQL Query to Logical Query Plan

Logical Query Plan • A tree whose nodes are relational algebra operators. PA1,…,An(sC (R1 ∞ R2 ∞ … ∞ Rk)) SELECTA1, …, An FROMR1, …, Rk WHEREC A1,…,An sC Rk … R1

Logical Query Plan • A tree whose nodes are extended relational algebra operators. A1,…,An SELECT A1, …, An, Func FROMR1, …, Rk WHEREC GROUP BY Ai, …, Aj PA1,…,An(gAi, …, Aj, Func(sC (R1 ∞ … ∞ Rk))) gAi, …, Aj, Func sC Rk … R1

SQL -> Logical Query Plan • Straightforward for simple SQL queries. • Map select-from-where to projection-join-selection • Subquery in where clause • Remove it!

Example: Removing Subqueries StarsIn(title, year, starName) Star(name, address, gender, birthdate) • More examples in 16.3.2 SELECT title FROMStarsIn WHEREstarNameIN ( SELECT name FROM Star WHERE birthdate LIKE ‘%1960’) SELECT title FROMStarsIn, Star WHEREstarName = name AND birthdate LIKE ‘%1960’

Query Rewriting

Query Rewriting • Different logical plans with the same meanings • Same outputs for all possible databases Star(name, address, gender, birthdate) sgender=‘male’ OR name=‘waltz’ U sgender=‘male’ sname = ‘waltz’ Star Generally Faster Star Star

Algebraic Laws • The ways that we can rewrite a logical plan. • Associative and communicative laws R ∞ S = S ∞R, R ∞(S ∞T) = (R ∞S) ∞ T R U S = S U R, R U (S U T) = (R U S) U T R S = S R, R (S T) = (R S) T • Distributive laws R ∞ (S U T) = (R ∞ S) U (R ∞ T) U U U U U U

Algebraic Laws for Selection U • C AND C’(R) = sC(sC’(R)) = sC(R) sC’(R) • C OR C’(R) = sC(R) U sC’(R) • C (R ∞ S) = sC (R) ∞ S • C includes only attributes of R • What if it includes attributes of R and S? sC (R – S) = sC (R) – S sC (R US) = sC (R) UsC (S) sC (R S) = sC (R) S U U

Example: Algebraic Laws for Selection • R(A, B, C, D), S(E, F) • sF=1 (R ∞ A=E S) = ? • sB=2 AND F=3 (R ∞ A=E S) = ?

Algebraic Laws for Projection U PM(PN(R)) = PM N(R) PM(R ∞ S) = PN(PP(R) ∞PQ(S)) • N, P, Q are appropriate subsets of M • Example R(A,B,C,D), S(E, F) • PB,C,F(R ∞A=E S) = P ? (P?(R) ∞P?(S))

Push Selection Down Uses sC (R ∞ S) = sC (R) ∞ S If the selection is executed, fewer tuples will be processes afterwards. It may cause to lose desired ordering, if we use indexes title title sgender=‘male’ AND year=1950 starName=name starName=name gender=‘male’ year=1950 Star StarsIn Star StarsIn

Push Selection Up and Down StarsIn(title, year, starName) Movie(title, year, genre, producer) starName Movie.title=StarsIn.title AND Movie.year=StarsIn.year StarsIn year=1940 Movie

Push Selection Up and Down StarsIn(title, year, starName) Movie(title, year, genre, producer) starName starName syear =1940 title=title AND year=year StarsIn year=1940 Movie.title=StarsIn.title AND Movie.year = StarsIn.year Movie StarsIn Movie

Push Selection Up and Down StarsIn(title, year, starName) Movie(title, year, genre, producer) starName starName starName syear =1940 title=title AND year=year Movie.title=StarsIn.title AND Movie.year=StarsIn.year StarsIn year=1940 title=title AND year=year year=1940 year=1940 Movie StarsIn Movie Movie StarsIn

Push Projection Down StarsIn(title, year, starName) Movie(title, year, genre, producer) title title StarsIn.title=Movie.title StarsIn.title=Movie.title title title Movie StarsIn Movie StarsIn We keep the notation at the top. Less effective than pushing down selection.

Push Projection Down It is not always possible. title sstarName=‘Waltz’ AND producer=‘Tarantino’ AND year=2012 StarsIn.title=Movie.title Movie StarsIn Be careful when there are set operations or aggregation functions in the tree!

Push Duplicate Elimination Down δ syear =1940 syear =1940 StarsIn.title=Movie.title StarsIn.title=Movie.title δ δ Movie StarsIn Movie StarsIn It is not always possible, e.g. aggregation functions. Read the book for algebraic laws of duplicate elimination.

Remove Unnecessary Duplicate Elimination δ syear > 1940 Uset title,year ttle,year StarsIn Movie

Remove Unnecessary Duplicate Elimination syear > 1940 δ δ syear > 1940 Uset Uset title,year ttle,year title,year ttle,year StarsIn Movie StarsIn Movie

Remove Unnecessary Duplicate Elimination syear > 1940 δ δ syear > 1940 syear > 1940 Uset Uset Uset title,year ttle,year title,year ttle,year title,year ttle,year StarsIn Movie StarsIn Movie StarsIn Movie

Query Optimization • 16.5

Optimization Strategies • Optimal approach • Enumerate each possible plan • Measure its performance by running it • Pick the fastest one • Heuristics approach • Use some fixed heuristics • e.g. always nested loop joins • e.g. order relations from smallest to largest • Selinger/ System R optimization

Cost-based Optimization • Plan Space • What is the space of query plans? • Cost estimation • How to estimate cost of each plan, without executing it? • Search Algorithms • How to search the space, as guided by cost estimates

Query Plans • Order of the associative and communicative operators • Join, union, …. • The algorithms for operators • Hash-join, merge-join, … • Implicit operators • Full-scan, index-scan, … • Communication between operators • Pipelining versus materializing • “Interesting order”

Join Order • Many join algorithms are asymmetric • Nested loop • S: outer relation, R: inner relation • S’s tuples are examined against R’s • Index-based join • S has index • S’s tuples are examined against R’s • Left relation: build relation • Right relation: probe relation

Join Order Many possible query plans (join trees) for R ∞ S ∞T ∞ U: BushyLeft deep U T R S T U R S R Right deep S T U

Join Ordering Problem • Given: R1 ⋈ R2 ⋈ … ⋈ Rn • We have a cost function that computes the cost of every join tree. • Find the most efficient join tree for the query. • Dynamic programming: try to optimize partial plans instead of the complete plan.

Dynamic Programming • Find the best plan for each subset of {R1, …, Rn} • Use the information to compute the best plan for larger subsets • Step 1: best plans for {R1}, {R2}, …, {Rn} • Step 2: best plans for {R1,R2}, {R1,R3}, …, {Rn-1, Rn} • … • Step n: best plan for {R1, …, Rn} • A bottom-up strategy

Dynamic Programming • For each subset S ⊆ {R1, …, Rn} calculate: • The estimated size of S: Size(S) • Plan(S): A best plan for S • Cost(S): The cost of Plan(S) • The cost of a plan is the size of intermediate relations in the plan. • We can use other types of costs.

Dynamic Programming • Step 1: For each {Ri}: • Size({Ri}) = B(Ri) • Plan({Ri}) = Ri • Cost({Ri}) = 0. • Step 2: For each {Ri, Rj}: • Size({Ri,Rj}) = estimate of the size of join (more later) • Plan({Ri,Rj}) = Ri∞ Rjor Rj∞Ri • Cost = 0 (no intermediate relations)

Dynamic Programming • Step i: For each S ⊆ {R1, …, Rn} of cardinality i do: • Compute Size(S) • For every S1 ,S2s.t. S = S1S2C = cost(S1) + cost(S2) + size(S1) + size(S2) • If Si is a base table, ignore its size. • Cost(S) = the smallest C • Plan(S) = the corresponding plan • Return Plan({R1, …, Rn})

Dynamic Programming: Example • Cost(R ⋈S) = 0 (no intermediate results) • Cost((R ⋈S) ⋈T) = Cost(R ⋈S) + Cost(T) + size(R ⋈S) [+ Size(T)] = size(R ⋈S)

Dynamic Programming: Example • Relations: R, S, T, U • Number of tuples: 2000, 5000, 3000, 1000 • A toy size estimation method: • Size (A ⋈ B) = 0.01*T(A)*T(B)

Dynamic Programming: Reducing Search Space • Exponential computation! • Consider only left-deep join trees • Ignore join trees with Cartesian product • The size of a Cartesian product is generally much larger than (natural) joins. • Example: R(A,B), S(B,C), U(C,D)(R ∞ U) ∞S has a Cartesian product pick (R ∞ S) ∞ U instead

Cost Estimation • Estimating the cost of a physical plan • The cost of a physical operators depend mainly on the size of its inputs • Sort-merge join for R ∞ S : 3 B(R) + 3 B(S) • In many queries the inputs of operators are intermediate results • Sort-merge join for (R ∞ S) ∞ T • 3 B(T) + 3 B(R ∞ S) + 3 B(R) + 3 B(S)

Size Estimation: Base Tables • Cost factors in query execution • B(R): Number of blocks in R • T(R): Number of tuples in R • V(R,A): Number of distinct values of attribute A in R

Size Estimation: Base Tables • B(R) = T(R ) / block size • T(R): scan the table • Update T(R) • Whenever DBA asks • Periodically

Size Estimation: Base Tables • V(R,A): too much information • Use histogram, …

Size Estimation: Intermediate Results • Estimating the size of a projection • Projection does not eliminate duplicates: T(PA(R)) = T(R)

Size Estimation: Selection • Equality: E = sA=a(R) • T(E) ranges from 0 to T(R) – V(R,A) + 1 • Consider its mean: T(E) = T(R)/V(R,A) • Inequality: E = sA<a(R) • T(E) ranges from 0 to T(R) • Rule of thumb: T(E) = T(R)/3 • Not equal: E = sA<>a(R) • Rule of thumb: T(E) = T(R)

Size Estimation: Selection • ANDin the condition: • R(A,B), E = sA=1 AND B<10(R) • Multiply: • 1/V(R,A) for equality, • 1/3 for inequality • 1 for not equal • T(R) = 10,000, V(R,A) = 50, T(E) = ? • Similar method for OR: • Read example 16.22

A A A Size Estimation: Join • Natural join, R S • Simple cases • The set of A values in R and S are disjoint: T(R S) = 0 • A is a key in S and a foreign key in R: T(R S) = T(R)

CS 440 Database Management Systems