Database Administration

Database Administration • Query Processing

Query Processing SELECT A.Name, B.GradeFROM A, BWHERE A.Id = B.Id  Name, Grade (Id=Id(A  B)) How to evaluate this query efficiently? What algorithms and access path to use?

Query Processing in Oracle You can view the query execution plan used in Oracle and other DBMSs

Explain in MySQL Simple means no union or subqueries All means do a full scan of the table Expected # rows to examine Filter rows using where clausesorting is needed for order by

Matching party_index to const to fetch the rows Explain in MySQL Expected # rows to examine decreases

Explain in MySQL Matches at most 1 row Use primary key to retrieve the row

Explain in MySQL SELECT in Outer Query Inner subquery Use primary index to fetch row with id = 300

Sorting • Sorting is used to implement many relational operations (e.g., join, project, intersect, …) • Ordering is specifically requested by users: SELECT … ORDER BY … • Eliminate duplicate tuples: SELECT DISTINCT … • To do sort-merge join (which is used for JOIN, UNION, and INTERSECTION operations) • Problems • Relations are typically large, do not fit in main memory • So we cannot use traditional in-memory sorting algorithms (such as quicksort)

External Sorting • Combines in-memory sorting with techniques for minimizing I/O • Cost of sorting is often measured in terms of number of block transfers • cost of in-memory sorting << I/O cost of block transfers • 2-phase in external sorting: • Partial sorting phase • Merging phase

2 6 5 3 External Sorting • nb = number of input buffer space in memory • b = number of disk blocks for the file to be sorted Example: nb = 2, b = 7 2 input buffers in main memory 5 3 2 6 1 10 15 7 20 11 8 4 7 5 Unsorted file on external storage (disk) run

Partial Sorting • Partial Sorting Phase: • Fetch a segment of the unsorted file from disk into buffers • Sort the data in buffers (e.g., using quicksort) • Write the sorted file segment back to disk • Example: nb=2, b=7 • Number of runs = 7/2 = 4 Unsorted data file 5 3 2 6 1 10 15 7 20 11 8 4 7 5 Partially sorted file 2 3 5 6 1 7 10 15 4 8 11 20 5 7 Run 2 Run 3 Run 4 Run 1

Merging • Merging Phase: • Merge all runs using k input buffers and 1 output buffer • If number of runs > k (k: degree of merging) • Divide runs into groups of size k and merge each group into a run • Repeat until all runs are merged into 1 group k-way merging

Merging Example Partially sorted Input Output 2-way merging 2 3 5 6 2 5 3 6 10 6 3 1 2 5 7 15 1 2 3 5 6 7 10 15 1 7 10 15 10 1 7 15 2 runs (of size 2) are merged into 1 run (of size 4) Input buffers Output buffer

Example of External Sorting • nb = 4, b = 10, k=3 (3-way merging): After partial sorting After 3-way merging

Example of External Sorting • nb = 4, b = 10, k=2 (2-way merging): After partial sorting After 2-way merging After 2-way merging

Cost of External Sorting • Suppose we have b = 1024 blocks and nB = 5 input buffers • Partial sorting: • number of runs, nR = 1024/5 = 205 • Merging: assume degree of merging, k = 5 • After pass 1: number of runs remaining = 205/5 = 41 • After pass 2: number of runs remaining = 41/5 = 9 • After pass 3: number of runs remaining = 9/5 = 2 • After pass 4: number of runs remaining = 2/5 = 1 • Number of merging passes needed = logk nR = logkb/nB

Cost of External Sorting • Partial sorting : • Cost = 2b (each disk block is read once and written back once) • Cost of merge phase (k is degree of merging) • b/nB runs to be merged • Number of passes needed = log k (b/nB) • Each pass requires 2b page transfers • Cost of merging = 2b log k(b/nB) = 2blog k (nR) • Overall cost of external sorting = cost of partial sorting + cost of merging = 2b + 2b log k (nR)

Query Processing SELECT AttribList FROM relations R1, …, Rk WHERE condition Translated to: AttribList (condition (R1  R2  …  Rk)) Example: SELECT Dno FROM Employee WHERE SSN = ‘1234567890’ Dno (SSN=‘1234567890’ (Employee))

Algorithms for Query Processing • Relational Algebra operators • SELECT • JOIN • PROJECT • SET (UNION, INTERSECT, SET DIFFERENCE) • AGGREGATE

Algorithms for SELECT Operations EMPLOYEE(SSN, Fname, Minit, Lname, Sex, Address, Salary, Dno) DEPARTMENT(Dno, Dname, Mgrssn, MgrStartDate) WORKS_ON(ESSN, Pno, Hours) Examples of SELECT operations: Simple selection (OP1): sSSN='123456789' (EMPLOYEE) (OP2): sDNUMBER>5(DEPARTMENT) (OP3): sDNO=5(EMPLOYEE) Complex selection: (OP4): sDNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE) (OP5): sESSN=123456789 AND PNO=10(WORKS_ON)

Algorithms for SELECT Operation Algorithms for Simple Selection: Linear search (brute force, full tablescan): Retrieve every record in the file, and test whether its attribute values satisfy the selection condition Binary search If the selection condition involves an equality comparison on a key attribute on which the file is ordered, binary search can be used Use primary/clustering/secondary index Use the index to find the record(s) satisfying the corresponding selection condition

Algorithms for SELECT Operation Algorithms for Complex Selection: DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE) Linear search (brute-force; full tablescan) Use an individual index E.g.: use the Dno index to retrieve the records and then check whether each retrieved record satisfies the remaining conditions Use a composite index E.g.: use the (Dno, Sex) composite index to retrieve the records and check whether they satisfy the remaining condition Intersection of record pointers: This method is possible if secondary indexes are available on all (or some of) the fields involved E.g.: intersect the record pointers returned by the indexes for Dno, Salary, and Sex

Algorithms for JOIN Operations R A=B S • The cost of joining two relations makes the choice of a join algorithm crucial • Examples • EMPLOYEE DNO=DNUMBER DEPARTMENT • DEPARTMENT MGRSSN=SSN EMPLOYEE

Computing Joins R A=B S • Suppose bR and bS are the number of blocks in R and S, rR and rS are the number of tuples in r and s rS rowsbS blocks S rR rows bR blocks R

Algorithms for JOIN Operations J1 Nested-loop join (brute force) R R.A=S.B S foreach tuple t  R do foreach tuple t’  S do if t.A = t’.B then output (t, t’) • Cost = bR + rR  bS + cost of writing the final result • Very expensive • Order of the loop matters

Algorithms for JOIN Operations J1(b) Nested-block join Instead of joining 1 tuple at a time, join one block at a time Cost = bR + bR/(nB – 2)  bS + cost of writing the result nB is the number of buffers available Number of blocks (bR) << Number of records (rR) R R1 R2 Memory buffer R1 R2 OutputBuffer Join R1 & R2 with blocks in S S1 S2 S3 S4 S S1 S2 S3 S4

Algorithms for JOIN Operations • J2 Single-loop join (Using an access structure) R A=B S foreach tuple t in Rdo { use index to find all tuples t’ in S satisfying t.A = t’.B; output (t.t’) } • Cost = bR + rR  cost of search + cost of writing output

Algorithms for JOIN Operations • J3 Sort-merge join R R,A=S.B S sortR on attribute A; sortS on attribute B; while !eof(R) and !eof(S) do { ScanR and S concurrently until t.A = t’.B = c; OutputA=c(R)  B=c (S) } A=c(R) R  S B=c (S)

Sort-Merge Join R D A 1 3 p p 0 9 q q 8 7 3 s s s 5 7 u u 1 1 v v 1 3 1 3 p p p p p p p p 4 0 0 4 8 7 3 s s s s s s 7 7 7 5 7 5 7 5 7 u u u u u u u u u u u u 2 2 5 5 0 0 B E p p 4 0 r 9 s 7 t t 2 5 u u u 2 5 0 x 0 RA=BS S

Algorithms for JOIN Operations • J4 Hash-join: • Use the same hashing function on the join attributes A of R and B of S as hash keys • Hash the file with fewer records (say, R) to the hash file buckets. • Hash the other file (S) to the appropriate bucket, where the record is combined with all matching records from R.

Example • EMPLOYEE DNO=DNUMBER DEPARTMENT • Meta-data: • DEPARTMENT: • Number of records, rD = 50 • Number of disk blocks to store records, bD = 10 • EMPLOYEE • Number of records, rE = 6000 • Number of disk blocks to store records, bE = 2000 • Number of buffers available in memory, nB = 7 • Size of each buffer is the same as size of each block on disk

Example: Nested Loop Join foreach tuple t  R do foreach tuple t’  S do if t.A = t’.B then output (t, t’) • Simple nested loop join • Cost = bR + rRbs + cost of output • Nested-block join • Suppose there are nB buffers • Use 1 buffer for output, 1 buffer for relation S, nB – 2 for relation R • Cost = bR + bR/(nB – 2) bs + cost of output

Example: Block Nested Loop Join • EMPLOYEE DNO=DNUMBER DEPARTMENT • If EMPLOYEE is outer loop • Cost = 2000 + 2000/5 10 + cost of output = 6000 • If DEPARTMENT is outer loop • Cost = 10 + 10/5 2000 + cost of output = 4010 • More efficient to use DEPARTMENT as the outer loop • Order of the tables in the nested loop matters!

MySQL Example

MySQL Example department table chosen for outer loop of nested block join employee table chosen for inner loop of nested block join

MySQL Example straight_join forces MySQL to process the join in the order given (employee for outer loop and department for inner loop)

Example: Single Loop Join • DEPARTMENT MGRSSN=SSN EMPLOYEE • Suppose there are multi-level indexes on • SSN (for EMPLOYEE) : number of index levels, XSSN = 4 • MGR_SSN (for DEPARTMENT): number of index levels, XMGR_SSN = 2 foreach tuple t in Rdo use index to find all tuples t’ in S satisfying t.A = t’.B; • Use EMPLOYEE as outer loop • Max Cost = bE + rE (xMGR_SSN + 1) + cost of output = 2000 + 6000  3 + cost of output = 20000 + cost of output • Use DEPARTMENT as outer loop • Max Cost = bD + rD  (xSSN + 1) + cost of output = 10 + 50  5 + cost of output = 260 + cost of output

MySQL Example Scan all the rows in department (outer loop of single loop join) For each mgrId value, use the primary key index in employee table to fetch its corresponding row

Explaining EXPLAIN in MySQL • type – determines how table is accessed (most frequent) • “ALL” - full table scan • “eq_ref” - reference by primary or unique key (1 row) • “ref” - reference by non-unique key (multiple rows) • possible_keys - indexes MySQL could use for this table • key – index MySQL sellected to use • “ref” - The column or constant this key is matched against • “rows” - How many rows will be looked up in this table • “extra” - Extra Information • “Using Filesort” - external sort is used • “Using where” - where clause will be resolved

Algorithms for PROJECT Operations • Algorithm for PROJECT operations <attribute list>(R) • If <attribute list> has a key of relation R, extract all tuples from R with only the values for the attributes in <attribute list>. • If <attribute list> does NOT include a key of relation R, duplicated tuples are removed from the results. • Methods to remove duplicate tuples • Sorting • Hashing

Algorithms for SET Operations • Set operations: • UNION, INTERSECTION, SET DIFFERENCE and CARTESIAN PRODUCT • CARTESIAN PRODUCT of relations R and S • Includes all possible combinations of records from R and S. • The attributes of the result include all attributes of R and S. • CARTESIAN PRODUCT operation is very expensive and should be avoided if possible. • If R has n records and j attributes and S has m records and k attributes, the result relation will have n*m records and j+k attributes.

Algorithms for SET Operations • UNION (See Figure 15.3c) • Sort the two relations on the same attributes. • Scan and merge both sorted files concurrently, whenever the same tuple exists in both relations, only one is kept in the merged results. • INTERSECTION (See Figure 15.3d) • Sort the two relations on the same attributes. • Scan and merge both sorted files concurrently, keep in the merged results only those tuples that appear in both relations. • SET DIFFERENCE R-S (See Figure 15.3e) • Sort the two relations on the same attributes. • Scan and merge both sorted files concurrently, keep in the merged results only those tuples that appear in relation R but not in relation S.

Implementing Aggregate Operations • Aggregate operators: • MIN, MAX, SUM, COUNT and AVG • Options to implement aggregate operators: • Table Scan • Use Index • Example: SELECT MAX (SALARY) FROM EMPLOYEE; • If an (ascending) index on SALARY exists, then the optimizer could traverse the index for the largest value, which would entail following the right most pointer in each index node from the root to a leaf.

Database Administration