Chapter 14 Query Optimization

Chapter 14 Query Optimization

Chapter 14: Query Optimization • Introduction • Statistical (Catalog) Information for Cost Estimation • Estimation of Statistics • Cost-based optimization • Revisiting Selection Algorithms • Transformation of Relational Expressions and Equivalence rules • Dynamic Programming for Choosing Evaluation Plans • Optimizing nested subqueries • Materialized views and view maintenance

Introduction • There exist many alternative ways of evaluating a given query: • Different algorithms for each operation (Chapter 13) • Equivalent relational algebraic expressions

Introduction, Cont. • Relations generated by two equivalent expressions have the same set of attributes and contain the same set of tuples, although their attributes may be ordered differently.

Introduction, Cont. • An evaluation plan (also known as a query plan, or query execution plan) defines exactly what algorithm is used for each operation, and how the execution of the operations is coordinated.

Introduction, Cont. • Cost difference between different plans can be enormous: • Example: performing a r X s followed by a selection r.A = s.B is much slower than performing a join on the same condition. • Optimizer frequently estimates the cost of operations: • Depends critically on statistical information that the database must maintain, e.g. number of tuples, number of distinct values for join attributes, etc. • Similar statistics must be estimated for intermediate results as well. • Statistics must be accurate and up-to-date: • 17 hours vs. 1/2 hour for 100k row, two-table join.

Introduction, Cont. • Choosing the cheapest algorithm for each operation independently may not yield best overall algorithm: • Merge-join may be costlier than hash-join, but may provide a sorted output which reduces the cost for an outer level aggregation. • Nested-loop join may be costlier than most other algorithms, but may provide opportunity for pipelining.

Cost-Based Optimization • Cost-based optimization - a query plan is developed as follows: • Generate logically equivalent expressions using equivalence rules. • Annotating resultant expressions to get alternative query plans. • Choosing the cheapest plan based on estimated cost.

Rule-Based (Heuristic) Optimization • Rule-based optimization - a query plan is developed by applying rules, or heuristics, that should reduce query cost (no cost estimate is made). • Relative to rule-based optimization, cost-based optimization is expensive, but worthwhile for queries on large datasets. • Most real query optimizers incorporate elements of both approaches.

Statistical Informationfor Cost Estimation • nr: number of tuples in a relation r. • br: number of blocks containing tuples of r. • sr: size of a tuple of r. • fr: blocking factor of r — i.e., the number of tuples of r that fit into one block. • If tuples of r are stored together physically in a file, then:

Statistical Informationfor Cost Estimation • More generally:

Statistical Informationfor Cost Estimation, Cont. • V(A, r): number of distinct values that appear in r for attribute A; same as the size of A(r). • SC(A, r): selection cardinality for attribute A of relation r; average number of records that satisfy equality on A. • SC(A,r) = nr / V(A,r) • SC(A, r)/fr — number of blocks that these records will occupy if the relation is sorted on attribute A. • min(A,r): minimum value on attribute A in relation r. • max(A,r): maximum value on attribute A in relation r. • More generally, a complete histogram can be stored for each attribute of a relation, e.g., SC(v, A, r)

Statistical Information about Indices • fi: average fan-out of internal nodes of index i, for tree-structured indices such as B+ trees (note: overloaded notation!) • HTi: number of levels in index i — i.e., the height of i. • For a balanced tree index (such as B+ tree) on attribute A of relation r, HTi = logfi(V(A,r)). • For a hash index, HTiis 1 (or 2) • LBi: number of lowest-level B+ tree index blocks in i — i.e, the number of blocks at the leaf level of the index.

Query Property Estimates • In addition to the cost of specific algorithms, the following properties of the result of a query will be estimated: • Result size, primarily in terms of the number of tuples. • The number of distinct values for a specific attribute, i.e., V(A, r). • These estimates are independent of the algorithm used for an operation. • Nonetheless, they are frequently used to evaluate the cost of an algorithm. • Particularly helpful when the result from a sub-query is provided as input to another query. • In contrast to block I/Os, these are one of the topics of this chapter.

Selection OperationAlgorithms Revisited • Algorithm A1 (linear search). Scan each file block and test all records to see whether they satisfy the selection condition. • Cost estimate (number of disk blocks scanned) = br • If selection is on a key attribute, cost = (br /2) • A2 (binary search). Applicable if selection is an equality comparison on the attribute on which file is ordered. • Assume that the blocks of a relation are stored contiguously • Cost estimate becomes (number of disk blocks to be scanned): • Equality condition on a key attribute: SC(A,r) = 1

Selection Using Indices Revisited • Recall that HTi = logfi(V(A,r)) for a B+ tree, or HTi = 1 for a hash index. • A3 (primary index on candidate key, equality). Retrieve a single record that satisfies the corresponding equality condition • Cost = HTi+ 1 • A4 (primary index on nonkey, equality) Retrieve multiple records. • Records will be on consecutive blocks. • Cost = HTi+ number of blocks containing retrieved records.

Selection Using Indices Revisited • A5 (equality on search-key of secondary index). • Retrieve a single record if the search-key is a candidate key • Cost = HTi+ 1 • Retrieve multiple records if search-key is not a candidate key • Cost = HTi+ SC(A,r)

Selections InvolvingComparisons Revisited • Selections of the form AV(r) • Let c denote the estimated number of tuples satisfying the condition. Then: • c = 0, if v < min(A,r) • c = otherwise • In absence of statistical information c is assumed to benr / 2. • The case of A  V(r) is symmetric - a similar analysis applies.

Selections InvolvingComparisons Revisited • Can implement selections of the form AV (r) or A  V(r) by using • a linear or binary search, or • by using indices in the following ways: • A6 (primary index, comparison). (Relation is sorted on A) • For AV (r) just scan relation sequentially till first tuple > v; do not use index • For A  V(r) use index to find first tuple  v and scan relation sequentially from there. Cost estimate is: • As noted before, in the absence of statistical information, c is assume to be nr/2 in which case:

Selections InvolvingComparisons Revisited • A7 (secondary index, comparison). • For A  V(r) use index to find first index entry  v and scan index sequentially from there, to find pointers to records. • In either case, retrieve records that are pointed to requires an I/O for each record. Cost estimate is: • As noted before, in the absence of statistical information, c is assume to be nr/2 in which case: • Linear file scan may be cheaper if many records are to be fetched! • For AV (r) just scan leaf pages of index finding pointers to records, till first entry > v; estimate is similar.

Complex Selections Revisited • Cost estimates for complex selections follow directly from the estimates for A1 through A7 • Conjunction: 1 2. . . n(r) • A8 (conjunctive selection using one index). • A9 (conjunctive selection using multiple-key index). • A10 (conjunctive selection by intersection of identifiers). • Disjunction:1 2 . . . n(r). • A11 (disjunctive selection by union of identifiers). • Negation: (r)

Implementation of Complex Selections • The selectivityof a condition i is the probability that a tuple in the relation r satisfies i . • If si is the number of satisfying tuples in r, the selectivity of i is si /nr. • Conjunction: 1 2. . .  n (r). The estimate fornumberoftuples in theresult is: • Disjunction:12. . . n (r). Estimated number of tuples: • Negation: (r). Estimated number of tuples:nr–size((r))

Join Operation: Running Example Running example: depositor customer Catalog information for join examples: • ncustomer = 10,000 • fcustomer = 25, which implies that bcustomer=10000/25 = 400 • ndepositor = 5000 • fdepositor= 50, which implies that bdepositor=5000/50 = 100 • V(customer-name, depositor) = 2500, which implies that , on average, each customer has two accounts. Also assume that customer-name in depositor is a foreign key on customer.

Estimation of the Size of Joins • Let R and S be the sets of attributes for relations r and s, respectively. • The size estimate for the natural join of r and s depends on the common attributes. • Question: Why not just use |r| * |s|? • This would certainly be a valid upper-bound. • If R  S = , then rs is the same as r x s. • The Cartesian product r x s contains nr*nstuples; each tuple occupies sr + ssbytes.

Estimation of the Size of Joins • If R  S is a key for R, then a tuple of s will join with at most one tuple from r. • Therefore, the number of tuples in r s is no greater than the number of tuples in s • Question: Could it be greater than the number of tuples in r? • If R  S is a foreign key in S referencing R, then the number of tuples in rs is exactly the same as the number of tuples in s. • The case for R  S being a foreign key referencing S is symmetric. • In the example query depositor customer, customer-name in depositor is a foreign key of customer • Hence, the result has exactly ndepositor tuples, which is 5000

Estimation of the Size of Joins (Cont.) • If R  S = {A} is not a key for R or S. If we assume that every tuple t in R produces tuples in R S, the number of tuples in RS is estimated to be:If the reverse is true, the estimate is: • Conjecture: size of r s • In other words, (according to the book) the lower of these two estimates is probably the more accurate one (V(A,r) is probably not equal to V(A,s)). • This is probably true since A is neither a key nor a foreign key, and so neither r nor s is likely to have every tuple included in the result. • Question: Since we typically do a worst-case analysis, shouldn’t the larger of the two be used?

Estimation of the Size of Joins (Cont.) • Compute the size estimates for depositor customer without using information about foreign keys: • V(customer-name, depositor) = 2500, andV(customer-name, customer) = 10000 • The two estimates are 5000 * 10000/2500 = 20,000 and 5000 * 10000/10000 = 5000 • We choose the lower estimate, which in this case, is the same as our earlier computation using foreign keys.

Size Estimation for Other Operations • Projection: estimated size of A(r) = V(A,r) • Aggregation : estimated size of AgF(r) = V(A,r) • Set operations • For unions/intersections of selections on the same relation rewrite and use size estimate for selections: • 1 (r)  2 (r) can be rewritten as 12(r). • 1 (r)  2 (r) can be rewritten as 1(2 (r)). • For operations on different relations: • estimated size of r  s = size of r + size of s. • estimated size of r  s = min(size of r, size of s). • estimated size of r – s = r.

Size Estimation (Cont.) • Outer join: • Estimated size of r s = size of r s + size of r • Case of right outer join is symmetric • Estimated size of r s = size of r s + size of r + size of s • As in the previous case, these establish upper bounds. • Note that many of these estimates may be quite inaccurate, but typically provide upper bounds on result sizes. • The slides contain at the end of the chapter contain estimates for the number of distinct values produced by various operations. • We will skip these, and you are not responsible for them.

Transformation of Relational Expressions • Two relational algebra expressions are said to be equivalent if the two expressions generate the same set of tuples in every legal database instance. • Note that the order of tuples is considered irrelevant. • Question: what about sorting? • An equivalence rule asserts that two expressions are equivalent.

Equivalence Rules 1. Conjunctive selection operations can be deconstructed into a sequence of individual selections. • Selection operations are commutative. • Only the last in a sequence of projection operations is needed, the others can be omitted.

Equivalence Rules (Cont.) • Selections can be combined with Cartesian products and theta joins. •  (E1X E2) = E1 E2 • 1(E12 E2) = E11 2E2 • Theta-join operations (and natural joins) are commutative.E1  E2 = E2 E1 • (a) Natural join operations are associative: (E1 E2) E3 = E1 (E2 E3)(b) Theta joins are associative in the following manner:(E1 1 E2) 23E3 = E1 13 (E22 E3) where 2involves attributes from only E2 and E3****

Equivalence Rules (Cont.) 7. The selection operation distributes over the theta join operation under the following two conditions:(a) When all the attributes in 0 involve only the attributes of one of the expressions (E1) being joined.0E1  E2) = (0(E1))  E2 (b) When 1 involves only the attributes of E1 and2 involves only the attributes of E2. 1 E1 E2) = (1(E1))  ( (E2))

Equivalence Rules (Cont.) 8. The projections operation distributes over the theta join operation as follows. Consider a join E1  E2, and let L1 and L2 be disjoint sets of attributes from E1 and E2, respectively. If  involves only attributes from L1 L2: Now let L3 be attributes of E1 that are involved in join condition , but are not in L1 L2, and let L4 be attributes of E2 that are involved in join condition , but are not in L1 L2. Then:

Equivalence Rules (Cont.) • The set operations union and intersection are commutative E1 E2 = E2 E1 E1 E2 = E2 E1 Note: set difference is not commutative. • Set union and intersection are associative. (E1 E2)  E3 = E1 (E2  E3) (E1 E2)  E3 = E1 (E2  E3)

Equivalence Rules (Cont.) • The selection operation distributes over ,  and –. (E1 – E2) = (E1) – (E2)and similarly for  and  in place of –(E1 – E2) = (E1) – E2 and similarly for in place of –, but not for 12. The projection operation distributes over union L(E1 E2) = (L(E1))  (L(E2))

Transformation Example • Query - Find the names of all customers who have an account at some branch located in Brooklyn: customer-name(branch-city = “Brooklyn” (branch (account depositor))) • Transformation using rule 7a gives: customer-name (branch-city =“Brooklyn” (branch) (account depositor)) • Performing the selection as early as possible reduces the size of the relation to be joined.

Example with Multiple Transformations • Query - Find the names of all customers with an account at a Brooklyn branch whose account balance is over $1000: customer-name (branch-city = “Brooklyn”  balance > 1000(branch (accountdepositor))) • Transformation using join associatively (Rule 6a): customer-name (branch-city = “Brooklyn”  balance > 1000((branchaccount) depositor)) • Applying rules 7a followed by 7b provides an opportunity to apply the “perform selections early” rule, resulting in the sub-expression: customer-name ((branch-city = “Brooklyn”(branch) balance > 1000(account)) depositor))

Multiple Transformations (Cont.)

Projection Operation Example • Consider: customer-name((branch-city = “Brooklyn” (branch) account) depositor) • When we compute: (branch-city = “Brooklyn” (branch) account ) we obtain a relation whose schema is: (branch-name, branch-city, assets, account-number, balance) • Add and push projections using equivalence rules 8a and 8b to eliminate unneeded attributes from intermediate results:customer-name (account-number(branch-city = “Brooklyn” (branch) account ) depositor)

Join Ordering Example • Equivalence rule 6 states that: (r1r2) r3 = r1 (r2r3 ) • Consequently, if r2r3 is quite large relative to the size of r1r2, then the expression is evaluated as: (r1r2) r3 so that the size of the temporary relation is minimized.

Join Ordering Example (Cont.) • Consider the expression: customer-name(branch-city = “Brooklyn” (branch) account depositor) • Could compute account depositor first, and join result with: branch-city = “Brooklyn” (branch) but account depositor is likely to be a large relation. • Since it is more likely that only a small fraction of the bank’s customers have accounts in branches located in Brooklyn, it is better to first compute: branch-city = “Brooklyn” (branch) account

Explicit Enumerationof All Equivalent Expressions • As noted previously, cost-based optimizers use equivalence rules to systematically generate expressions equivalent to a given expression, using an algorithm such as: S={e}; repeat for (each ex1 in S) loop for (each equivalence rule r) loop if (r applies to ex1) then { apply r to ex1 to get ex2; add ex2 to S; } until (no new expressions have been found); • This approach is very expensive in space and time

Explicit Enumeration, Cont. • For example, consider finding the best join-order for r1, r2,. . ., rn. • There are (2(n – 1))!/(n – 1)! different join orders for above expression. • with n = 7, the number is 665280 • with n = 10, thenumber is greater than 176 billion! • Explicitly generating and evaluating each join order would be expensive and also redundant (as noted previously).

Applying Dynamic Programming to Optimization of Complex Joins • When E1 is generated from E2 by an equivalence rule, usually only the top level of the two are different, sub-expressions below are the same: ((r1r2) r3) r4 = (r1r2) (r3 r4) • Time and space requirements can be reduced by generating and developing a plan for each unique sub-expression at most once. • This applies to expressions other than joins also.

Applying Dynamic Programming to Optimization of Complex Joins, Cont. • Using dynamic programming techniques, the least-cost join order for any subset of {r1, r2, . . . rn} can be computed only once and stored for future use. • To find best plan for a set S of n relations, consider all possible plans of the form: S1 (S – S1) where S1 is any non-empty subset of S. • Recursively compute the costs for joining subsets of S to find the cost of each plan. Choose the cheapest of the 2n– 1 alternatives. • Whenever the plan for any subset is computed, store it for later use so that it doesn’t need to be re-computing. • Dynamic programming

Applying Dynamic Programming to Optimization of Complex Joins, Cont. • Considering all possible plans of the form: S1 (S – S1) where S1 is any non-empty subset of S. • Example, suppose S = {r1, r2, r3, r4, r5, r6} S1 S – S1 {r1, r2, r3} {r4, r5, r6} {r3, r4} {r1, r2, r5, r6} {r1, r4, r5, r6} {r2, r3} {r3} {r1, r2, r4, r5, r6} {r6} {r1, r2, r3, r4, r5} {r2, r3, r4, r5} {r1, r6} : :

Dynamic Programming Join Order Optimization Algorithm S : A set of relations to be joined {r1, r2, . . . , rn} bestplan : An array bestplan containing one location for each subset of S. Each location contains two values bestplan[S].costand bestplan[S].plan For each set S containing one relation, bestplan[S].cost = 0, for all others bestplan[S] =  procedure findbestplan(S)if (bestplan[S].cost  ) // bestplan[S] has been computed earlier, so return itreturn bestplan[S];else // bestplan[S] has not been computed earlier, so compute it nowfor (eachnon-empty subset S1 of S such that S1  S) loopP1= findbestplan(S1); P2= findbestplan(S - S1); A = best plan (algorithm and order) for joining results of P1 and P2; cost = P1.cost + P2.cost + cost of A;if (cost < bestplan[S].cost) then bestplan[S].cost = cost;bestplan[S].plan = “execute P1.plan; execute P2.plan; join results of P1 and P2 using A” end if; end loop; returnbestplan[S]; end if; end;

Cost of DynamicProgramming Algorithm • Worst case running time is O(3n). • With n = 10, this number is 59000 instead of 176 billion! • Space used is O(2n) • Running time can be reduced by considering only left-deep join orders: • Consider n alternatives with one relation as right-hand side input and the other n-1 relations as left-hand side input. • Using (recursively computed and stored) least-cost join order for each alternative on left-hand-side, choose the cheapest of the n alternatives.

Left Deep Join Trees • In left-deep join trees, the right-hand-side input for each join is a relation, not the result of an intermediate join. • If only left-deep trees are considered, the worst case running time of the (modified) algorithm is O(n2n) • Space complexity remains at O(2n)

Chapter 14 Query Optimization