Estimating the Cost of Operations

Estimating the Cost of Operations • We don’t want to execute the query in order to learn the costs. So, we need to estimate the costs. • How can we estimate the number of tuples in an intermediate relation? Rules about estimation formulas: • Give (somehow) accurate estimates • Easy to compute

Projection • Projection  retains duplicates, so the number of tuples in the result is the same as in the input. • Result tuples are usually shorter than the input tuples. • The size of a projection is the only one we can compute exactly.

Selection Let S = A=c (R) We can estimate T(S) = T(R) / V(R,A) Let S = A<c (R) On average, T(S) would be T(R)/2, but more properly: T(R)/3 Let S = Ac (R), Then, an estimate is: T(S) = T(R)*[(V(R,A)-1)/V(R,A)], or simply T(S) = T(R)

Selection ... Let S = C AND D(R) = C(D(R))and U = D(R). First estimate T(U) and then use this to estimate T(S). Example S = a=10 ANDb<20(R) T(R) = 10,000, V(R,a) = 50 T(S) = (1/50)* (1/3) * T(R) = 67 Note: Watch for selections like: a=10 AND a>20(R)

Selection ... • Let S = C OR D(R). • Simple estimate: T(S) = T(C(R)) + T(D(R)). • Problem: It is possible that T(S)T(R)! A more accurate estimate • Let: • T(R)=n, • m1 = size of selection on C, and • m2 = size of selection on D. • Then T(S) = n(1-(1-m1/n)(1-m2/n))Why? • Example: S = a=10 OR b<20(R). T(R) = 10,000, V(R,a) =50 • Simple estimation: T(S) = 3533 • More accurate: T(S) = 3466

Natural Join R(X,Y) S(Y,Z) • Anything could happen! • No tuples join: T(R S) = 0 • Y is key in S and a foreign key in R (i.e., R.Y refers to S.Y): Then, T(R S) = T(R) • All tuples join: i.e. R.Y=S.Y = a. Then, T(R S) = T(R)*T(S)

Two Assumptions • Containment of value sets • If V(R,Y) ≤V(S,Y), then every Y-value in R is assumed to occur as a Y-value in S • When such thing can happen? • For example when:Y is foreign key in R, and key in S • Preservation of set values • If A is an attribute of R but not S, then it is assumed that • V(R S, A)=V(R, A) • This may be violated when there are dangling tuples in R • There is no violation when: Y is foreign key in R, and key in S

Natural Join size estimation • Let, R(X,Y) and S(Y,Z), where Y is a single attribute. • What’s the size of T(R S)? • Let r be a tuple in R and s be a tuple in S. What’s the probability that r and s join? • Suppose V(R,Y)  V(S,Y) • By the containment of set values we infer that: • Every Y’s value in R appears in S. • So, the tuple r of R surely is going match with some tuples of S, but what’s the probability it matches with s? • It’s 1/V(S,Y). • Hence, T(R S) = T(R)*T(S)/V(S,Y) • When V(R,Y)  V(S,Y) • By a similar reasoning, for the case when V(S,Y)  V(R,Y), we get T(R S) = T(R)*T(S)/V(R,Y). • So, sumarizing we have as an estimate: T(R S) = T(R)*T(S)/max{V(R,Y),V(S,Y)}

T(R S) = T(R)*T(S)/max{V(R,Y),V(S,Y)} • Example: • R(a,b), T(R)=1000, V(R,b)=20 • S(b,c), T(S)=2000, V(S,b)=50, V(S,c)=100 • U(c,d), T(U)=5000, V(U,c)=500 • Estimate the size of R  S U • T(R S) = • 1000*2000 / 50 = 40,000 • T((R S)U) = • 40000 * 5000 / 500 = 400,000 • T(S U) = • 20,000 • T(R (S U)) = • 1000*20000 / 50 = 400,000 • The equality of results is not a coincidence. • Note 1: estimate of final result should not depend on the evaluation order • Note 2:intermediate results could be of different sizes

Natural join with multiple join attrib. • R(x,y1,y2)  S(y1,y2,z) • T(R  S) = T(R)*T(S)/m1*m2, where m1 = max{V(R,y1),V(S,y1)} m2 = max{V(R,y2),V(S,y2)} • Why? • Let r be a tuple in R and s be a tuple in S. What’s the probability that r and s agree on y1? From the previous reasoning, it’s 1/max{V(R,y1),V(S,y1)} • Similarly, what’s the probability that r and s agree on y2? It’s 1/max{V(R,y2),V(S,y2)} • Assuming that aggrements on y1 and y2 are independent we estimate: T(R S) = T(R)*T(S)/[max{V(R,y1),V(S,y1)} * max{V(R,y2),V(S,y2)}] Example: T(R)=1000, V(R,b)=20, V(R,c)=100 T(S)=2000, V(S,d)=50, V(S,e)=50 R(a,b,c) R.b=S.d AND R.c=S.e S(d,e,f) T(R  S) = (1000*2000)/(50*100)=400

Another example: (one of the previous) • R(a,b), T(R)=1000, V(R,b)=20 • S(b,c), T(S)=2000, V(S,b)=50, V(S,c)=100 • U(c,d), T(U)=5000, V(U,c)=500 • Estimate the size of R  S U • Observe that R  S U = (R U)  S • T(R  U) = • 1000*5000 = 5,000,000 • Note that the number of b’s in the product is 20 (V(R,b)), and the number of c’s is 500 (V(U,c)). • T((R U)S) = • 5,000,000 * 2000 / (50 * 500) = 400,000

Size estimates for other operations Cartesian product: T(R S) = T(R) * T(S) Bag Union: sum of sizes Set union: larger + half the smaller. Why? Because a set union can be as large as the sum of sizes or as small as the larger of the two arguments. Something in the middle is suggested. Intersection: half the smaller. Why? Because intersection can be as small as 0 or as large as the sizes of the smaller. Something in the middle is suggested. Difference: T(R-S) = T(R) - 1/2*T(S) Because the result can have between T(R) and T(R)-T(S). Something in the middle is suggested.

Size estimates for other operations Duplicate elimination in (R(a1,...,an)): The size ranges from 1 to T(R). T((R))= V(R,[a1...an]), if available (but usually not available). Otherwise: T((R))= min[V(R,a1)*...*V(R,an), 1/2*T(R)] is suggested. Why? V(R,a1)*...*V(R,an) is the upper limit on the number of distinct tuples that could exist 1/2*T(R) is because the size can be as small as 1 or as big as T(R) Grouping and Aggregation: similar to , but only with respect to grouping attributes.

Computing the statistics • It is the DBA who orders the statistics to be computed by the system. • T(R)’s, and V(R,A)’s are just aggregation queries (COUNT queries). • However, they are expensive to be computed.

Incremental computation of statistics • Maintaining T(R): • Add 1 for every insertion and subtract 1 for every deletion. • What’s the problem? • If there is a B-Tree on any attribute of R, then: • Just keep track of the B-Tree blocks and infer the approximate size of the relation. • Requires effort only when? • On B-Tree changes, which is relative rare compared with the rate of insertions and deletions.

Incremental computation of statistics • Maintaining V(R,A): • If there is an index on attribute A of a relation R, then: • On insert into R, we must find the A-value for the new tuple in the index anyway, and so we can determine whether there is already such a value for A. If not increment V(R,A). • On deletion… • If there isn’t an index on A, the system could in effect create a rudimentary index by keeping a data structure (e.g. B-Tree) that holds every value of A. • Final option: Sampling the relation.

4 7 rest Most frequent values Histograms 1-2 3-4 4-5 6-7 8-9 Equal width Advantage: more accurate estimate of the size of a join.

Example (freq. Values histogram) Estimate U = R(a,b)  S(b,c) V(R,b) = 14. Histogram for R.b: 0:150, 1:200, 5:100, rest: 550 V(S,b) = 13. Histogram for S.b: 0:100, 1:80, 2:70, rest: 250 • Tuples in U • on 0: • 100*150 = 15,000 • on 1: • 200*80 = 16,000 • on 2: • 70 * (550/(14-3)) = 3500 • on 5: • 100 * (250/(13-3)) = 2500 • on the 9 other values: • 9*(550/11)*(250/10) = 9*1250 We have 9 values, because V(S,b)<V(R,b), and by the preservation of value sets assumption, all the 9 values we didn’t consider yet in S, will be in R as well. Total T(U) = 15000 + 16000 + 3500 + 2500 + 9*1250 = 48,250 Simple estimate (equal occurrence assumption) T(U) = 1000*500/14 = 35,714

Example (equal width histogram) • Schemas: Jan(day,temp) July(day,temp) • Query: SELECT Jan.day, July.day FROM Jan, July WHERE Jan.temp=July.temp; Size of join of each band T1*T2/Width On band 40-49: 10*5/10 = 5 On band 50-59: 5*20/10 = 10  size of the result is thus 5+10 = 15 Without using the histogram we would estimate the size as 245*245/100 = 600 !!

Estimating the Cost of Operations