DB performance tuning using indexes

DB performance tuning using indexes Section 8.5 and Chapters 20 (Raghu)

What do you need to know? • A relational operator can be executed using different physical algorithms • An SQL query corresponds to a relational algebra expression and may map to alternative equivalent execution plans

Join Operation • Several different algorithms to implement joins • Nested-loop join • Block nested-loop join • Indexed nested-loop join • Merge-join • Hash-join • Choice based on cost estimate

Nested-Loop (NL) Join • To compute the theta join rsfor each tuple tr in r do begin for each tuple tsin s do begintest pair (tr,ts) tosee if they satisfy the join condition  if they do, add tr.tsto the result.endend • r is called the outerrelation and s the inner relation of the join. • Requires no indices and can be used with any kind of join condition. • Expensive since it examines every pair of tuples in the two relations.

Merge-Join • Sort both relations on their join attribute (if not already sorted on the join attributes). • Merge the sorted relations to join them • Join step is similar to the merge stage of the sort-merge algorithm. • Main difference is handling of duplicate values in join attribute: every pair with same value on join attribute must be matched

Equivalent execution plans select d.customer-name from branch b, account a, depositor d where b.branch-name = a.branch-name and a.account-number = d.account-number and b.branch-city = ‘Brooklyn’

Understanding the Workload • For each query in the workload: • Which relations does it access? • Which attributes are retrieved? • Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? • For each update in the workload: • Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? • The type of update (INSERT/DELETE/UPDATE), and the attributes that are affected.

Choice of Indexes • What indexes should we create? • Which relations should have indexes? What field(s) should be the search key? Should we build several indexes? • For each index, what kind of an index should it be? • Clustered? Hash/tree?

Choice of Indexes (Contd.) • One approach: Consider the most important queries in turn. Consider the best plan using the current indexes, and see if a better plan is possible with an additional index. If so, create it. • Obviously, this implies that we must understand how a DBMS evaluates queries and creates query evaluation plans! • For now, we discuss simple 1-table queries. • Before creating an index, must also consider the impact on updates in the workload! • Trade-off: Indexes can make queries go faster, updates slower. Require disk space, too.

Index Selection Guidelines • Attributes in WHEREclause are candidates for index keys. • Exact match condition suggests hash index. • Range query suggests tree index. • Clustering is especially useful for range queries; can also help on equality queries if there are many duplicates. • Multi-attribute search keys should be considered when a WHERE clause contains several conditions. • Order of attributes is important for range queries. • Such indexes can sometimes enable index-only strategies for important queries. • For index-only strategies, clustering is not important! • Try to choose indexes that benefit as many queries as possible. Since only one index can be clustered per relation, choose it based on important queries that would benefit the most from clustering.

Examples of Clustered Indexes • B+ tree index on E.age can be used to get qualifying tuples. • How selective is the condition? • Is the index clustered? • Consider the GROUP BY query. • If many tuples have E.age > 10, using E.age index and sorting the retrieved tuples may be costly. • Clustered E.dno index may be better! • Equality queries and duplicates: • Clustering on E.hobby helps! SELECT E.dno FROM Emp E WHERE E.age>40 SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>10 GROUP BY E.dno SELECT E.dno FROM Emp E WHERE E.hobby=Stamps

Indexes with Composite Search Keys Examples of composite key indexes using lexicographic order. • Composite Search Keys:Search on a combination of fields. • Equality query: Every field value is equal to a constant value. E.g. wrt <sal,age> index: • age=20 and sal =75 • Range query: Some field value is not a constant. E.g.: • age =20; or age=20 and sal > 10 • Data entries in index sorted by search key to support range queries. • Lexicographic order, or • Spatial order. 11,80 11 12 12,10 name age sal 12,20 12 13,75 bob 12 10 13 <age, sal> cal 11 80 <age> joe 12 20 10,12 sue 13 75 10 20 20,12 Data records sorted by name 75,13 75 80,11 80 <sal, age> <sal> Data entries in index sorted by <sal,age> Data entries sorted by <sal>

Composite Search Keys • To retrieve Emp records with age=30 ANDsal=4000, an index on <age,sal> would be better than an index on age or an index on sal. • Choice of index key orthogonal to clustering etc. • If condition is: 20<age<30 AND 3000<sal<5000: • Clustered tree index on <age,sal> or <sal,age> is best. • If condition is: age=30 AND 3000<sal<5000: • Clustered <age,sal> index much better than <sal,age> index! • Composite indexes are larger, updated more often.

Index-Only Plans • A number of queries can be answered without retrieving any tuples from one or more of the relations involved if a suitable index is available. SELECT E.dno, COUNT(*) FROM Emp E GROUP BY E.dno <E.dno> SELECT E.dno, MIN(E.sal) FROM Emp E GROUP BY E.dno <E.dno,E.sal> Tree index! <E. age,E.sal> or <E.sal, E.age> SELECTAVG(E.sal) FROM Emp E WHERE E.age=25 AND E.sal BETWEEN 3000 AND 5000 Tree!

Index Selection for Joins (20.3) • When considering a join condition: • Hash index on inner relation (where the search key includes the join columns) is very good for Index Nested Loops. • Should be clustered if join column is not key for inner, and inner tuples need to be retrieved. • Clustered B+ tree on join column(s) good for Sort-Merge, because don’t need to sort. • As examples show, our choice of indexes is guided by the plan(s) that we expect an optimizer to consider for a query.

D.dno = E.dno Example 1 SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE D.dname=‘Toy’ AND E.dno=D.dno • Hash index on D.dname supports ‘Toy’ selection. • Given this, index on D.dno is not needed. • Hash index on E.dno allows us to get matching (inner) Emp tuples for each selected (outer) Dept tuple. E.ename, D.mgr (INL) dname = ‘Toy’ Dept Emp (hash index on dno) (hash index on dname)

D.dno = E.dno Example 1a SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE D.dname=‘Toy’ AND E.dno=D.dno AND E.age = 25 • What if WHERE included: `` ... AND E.age=25’’ ? • Could retrieve Emp tuples using index on E.age, then join with Dept tuples satisfying dname selection. Comparable to strategy that used E.dno index. • So, if E.age index is already created, this query provides much less motivation for adding an E.dno index. D.dname = ‘Toy’ E.age = 25 Dept Emp (hash index on E.age) (hash index on D.dname)

SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE E.sal BETWEEN 10000 AND 20000 AND E.hobby=‘Stamps’ AND E.dno=D.dno Example 2 • Clearly, Emp should be the outer relation. • Suggests that we build a hash index on D.dno. • What index should we build on Emp? • B+ tree on E.sal could be used, OR an(hash) index on E.hobbycould be used. Only one of these is needed, and which is better depends upon the selectivity of the conditions. • As a rule of thumb, equality selections more selective than range selections.

D.dno = E.dno Clustering and Joins (20.4) (INL) SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE D.dname=‘Toy’ AND E.dno=D.dno • Clustering is especially important when accessing inner tuples in INL => should make index on E.dno clustered. D.dname = ‘Toy’ (hash index on D.dname) Dept Emp (hash index on E.dno)

dno = dno Clustering and Joins (cont.) SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE E.hobby=‘Stamps AND E.dno=D.dno (Sort-Merge) • If many employees collect stamps, Sort-Merge join may be worth considering. A clustered index on D.dno would help. Summary: Clustering is useful whenever many tuples are to be retrieved. hobby = ‘Stamps’ Emp Dept (hash index on D.dno) (hash index on E.hobby)

Rewriting queries (20.7.3, 20.9) • If a query runs slower than expected, check the plan that is being used. The choice of indexes may have to be adjusted, an index may need to be re-built , statistics may be too old, or a query may have to be rewritten. • Sometimes, the DBMS may not be executing the plan you had in mind. Common areas of weakness: • Selections involving null values, arithmetic or string expressions (ex: WHERE E.age = 2*D.age) • Selections involving OR conditions (next slide) • Lack of evaluation featureslike index-only strategies or certain join methods or poor size estimation.

Rewriting queries with DISTINCT • Minimize the use of DISTINCT: • don’t need it if duplicates are acceptable, • or if answer contains a key. • Goal is to avoid expensive operations, like duplicate elimination

Rewriting selections involving OR • Suppose indexes on hobby and age • Rewrite query using union. SELECT E.dno FROM Employees E WHERE E.hobby = ‘Stamps’ UNION SELECT E.dno FROM Employees E WHERE E.age = 10 SELECT E.dno FROM Employees E WHERE E.hobby = ‘Stamps’ OR E.age = 10 =

Rewriting query with GROUP BY/HAVING • Minimize the use of GROUP BYand HAVING: SELECT MIN (E.age) FROM Employee E GROUP BY E.dno HAVING E.dno=102 SELECT MIN (E.age) FROM Employee E WHERE E.dno=102 =

Rewriting nested queries • Use only one “query block”, if possible. SELECT DISTINCT * FROM Sailors S WHERE S.sname IN (SELECT Y.sname FROM YoungSailors Y) SELECT DISTINCT S.* FROM Sailors S, YoungSailors Y WHERE S.sname = Y.sname =

DB performance tuning using indexes