Query Optimization and Indexes

Query Optimization and Indexes

Query Optimization and Indexes Introduction Relational Databases DB2/400

QUERY OPTIMIATION AND INDEXES Introduction Overview Research Problem Literature

INTRODUCTION Overview • IBM has a DBMS called DB2. • Query optimization • Can significantly improve database performance • If the right indexes are available, queries can generally be implemented using better performing algorithms.

INTRODUCTION Research Problem • Due to the complexity of the query optimizer, customers are often baffled about indexes and query response time. Users need advise on what indexes would offer the best performance. • Management asks: How can an intelligent index manager be created? • Related question: How can queries be evaluated to determine a set of indexes which may minimize the total cost of database transactions, thus optimizing performance?

INTRODUCTION Literature • Index selection problem (ISP) for secondary indexes is a well-known optimization problem. • This problem known to be NP-Complete.

QUERY OPTIMIATION AND INDEXES Relational Databases Indexes Background Query Optimization Table Access and Joins

RELATIONAL DBs Background • Relational model introduced by Codd • Based on relations (tables), tuples (records), and attributes (fields) • Queries can be represented using relational algebra, relational calculus, or as a graph. • SQL is a high level query language.

RELATIONAL DBs Query Optimization • A query optimizer takes the query and creates a procedural sequence of implementation steps known as an access plan. • This plan is created after analyzing various alternatives to arrive at the best choice. • Table indexes are critical in creating efficient access plans. • Objectives of query optimization: • Minimize response time • Minimize usage of system resources

RELATIONAL DBs Query Optimization Objectives • Hardware objectives include minimizing: • CPU costs • Communications costs (in a network) • I/O costs for accessing secondary storage • Cost of using main memory and secondary storage • Software objectives include minimizing: • Cost of query optimization (should be small compared to execution cost)

RELATIONAL DBs Query Optimization Process Rewriter Algebraic Space Cost Model Planner Method Structure Space Size-Distribution Estimator

RELATIONAL DBs Query Optimization Process • Rewriter • creates an internal query representation. • applies transformations to streamline query evaluation.

RELATIONAL DBs Query Optimization Process • Planner • maps the transformed query • into various sequences of operations (algebraic space) • which can be implemented (method structure space) • with a known cost (cost model & size distribution), creating candidate access plans. • computes the cost of each candidate plan, and chooses the cheapest one.

RELATIONAL DBs Indexes B-Tree Indexes Clustered/Non-clustered Indexes Hash Indexes

RELATIONAL DBs Indexes Introduction • Memory access on the order of nanoseconds • For example, 40 (.00000004 sec.) • Disk access on the order of milliseconds • For example, 25 (.025 sec.) i.e., 40 random I/O’s per sec. • .016 sec. Seek time (Move disk arm to the proper cylinder) • .008 sec. Rotational latency (Rotate platter into position) • .001 sec. Transfer time (to read/write data) • .025 sec. Total • $ cost of disk cheap compared to RAM

RELATIONAL DBs Indexes B-Tree Indexes • Composition • Root level node (one at the first level) • Directory (or index) nodes (usually one or two levels) • Leaf level nodes (the bottom level, pointing to records) ... 221 np 346 np 398 ... … 278 rid 305 rid 346 rid … … 377 rid 411 rid 449 rid … np = node pointer to disk page rid = relative row ID to actual record

RELATIONAL DBs Indexes B-Tree Indexes • Significantly, due to the frequency of access, upper-levels of a B-tree index may remain in memory. • For example, to access a record in a million record database: assuming fanout of 256 • B-tree index has 3 levels CEIL(log256 1,000,000) • B-tree index probed 3 times reading 3 pages into RAM • Only 1 disk I/O may be required to read in the disk page holding the leaf node pointing to the desired record.

RELATIONAL DBs Indexes B-Tree Indexes • B-tree index on a one million record database has 34 nodes above the leaf nodes • Assume: • key values 4 bytes long (an integer) • node pointers & row IDs 4 bytes long • fill factor 70% • disk page header 48 bytes • disk page size 2 KB • so (2048-48) *.70/(4+4) 175 entries per disk page • CEIL(1,000,000/175) 5715 disk pages for leaf nodes • CEIL(5715/175) 33 disk pages for directory nodes • CEIL(33/175) 1 disk page for parent node(s), root

RELATIONAL DBs Indexes Clustered/Non-clustered Indexes • In a clustered index the records are stored in the same order as the key. • Provides superior performance, for example: • Database of 10 million customer in 200 cities • Customer records of 100 bytes • so 20 data records per disk page (2048-48)/100 • so 500,000 disk pages for database (10,000,000/20) • Mailing for a specific city averages 50,000 customers • If clustered index on city, only 2500 disk pages of data (50,000/20) need to be scanned, taking 1 min. (2500/40*60) • If non-clustered, we could assume scan 50,000 disk pages of data, taking 20 min. (50,000/40*60)

RELATIONAL DBs Indexes Hash Indexes • No disk file of keys, only a single I/O required (ideally) to read a record from disk. • The record key is the input value to a some hash function, whose output becomes the key to a disk page or relative record position. • Collisions, well-known problem, where 2 records have the same hash value. • Ideally, rehash to same disk page to minimize I/O. • Generally, cannot increase size of hash table. • Extensible hashing allows hash table to grow, based on a linear hashing algorithm.

RELATIONAL DBs Table Access and Joins No Indexes Nested Loop Joins Simple Indexes Sort-Merge Joins Composite Indexes Multiple Indexes

RELATIONAL DBs Table Access and Joins Introduction • When a database system receives a query it compiles the query. • This includes syntax checking and query optimization. • The result of the compilation step is an access plan (a series of steps specific to the computer and database) which will execute the query at run time. • Query optimizer • Minimizes CPU time & number of I/O requests. • CPU memory, although important, may be limited to certain established levels.

RELATIONAL DBs Table Access and Joins No Indexes • A table scan is an access step where all the rows in a table are sequentially searched. • The data collected is restricted by the WHERE clause of an SQL statement. • Access plan (DB2/MVS): • ACCESSTYPE column • Letter R, for a table scan • PREFETCH column • Blank, if random I/O • Letter S, if multi-block I/O (called sequential prefetch)

RELATIONAL DBs Table Access and Joins Simple/Composite Indexes • A matching index scan: • implemented in DB2/MVS using a B-tree index. • based on a single index • rows retrieved from a table based on the condition(s) specified in the WHERE clause of the SELECT statement.

RELATIONAL DBs Table Access and Joins Simple Indexes • The access plan for a SELECT statement may use indexes to limit the number of rows searched in a database. For example: If ZipIdx exists on ZipCode in the Customers table, SELECT * FROM Customers WHERE ZipCode = 56001 implemented in DB2/MVS using a matching index scan. • The plan for this query takes one step and is designated by the following DB2/MVS access plan: ACCESSTYPE = I for an index scan ACCESSNAME = ZipIdx MATCHCOLS = 1

RELATIONAL DBs Table Access and Joins Composite Indexes • A matching index scan retrieves records from a table when the column components of the index can be matched to the predicates of the WHERE clause. For example: If MailIdx exists on ZipCode + IncomeLevel + MaritalStatus SELECT Name, Address FROM Customers WHERE ZipCode = 56001 AND IncomeLevel = 10 implemented in DB2/MVS using a matching index scan. • The DB2 access plan will have: ACCESSTYPE = I for an index scan ACCESSNAME = MailIdx MATCHCOLS = 2

RELATIONAL DBs Table Access and Joins Multiple Indexes • Multiple index access used when different indexes on the predicates in the WHERE clause. For example: SELECT Name, Address FROM Customers WHERE ZipCode = 56001 AND (IncomeLevel = 10 OR MaritalStatus = “M”) DB2 Access Plan assuming ZipIdx, IncomeIdx, and MaritalIdx : TNAME MATCHCOLS PREFETCH ACCESSTYPE ACCESSNAME MIXOPSEQ Customers M 0 L 0 Customers MX 1 IncomeIdx S 1 Customers MX 1 MaritalIdx S 2 Customers MU 0 3 Customers MX 1 ZipIdx S 4 Customers MI 0 5

RELATIONAL DBs Table Access and Joins Overview • Common methods to join tables • Nested Loop • Sort Merge • Hash • In DB2, multi-step access plans for join processing make use of columns called: • PLANNO(1, 2, …) to indicate which step • METHOD(1=Nested Loop, 2=Sort Merge, 4=Hybrid Join) to indicate the join technique chosen by the query optimizer • TABNO(such as 1 or 2) to indicate whether we are extracting rows from the first or second table of the join • SORTN_JOIN(Y or N) to indicate whether a sort is required (for example, for the sort merge join)

RELATIONAL DBs Table Access and Joins Overview • A join of two tables occurs in two steps • where one table becomes the outer table • and the other becomes the inner table. • During join processing • records from outer table are presented one-by-one to the inner table • and the inner table searched for records matching the one presented to it.

RELATIONAL DBs Table Access and Joins Nested Loop Joins • Process: • Records of the outer table are retrieved (or presented) using a table scan (or indexing, if possible). • Only candidate records are retrieved. • Candidate records satisfy local predicates. • For each retrieved row from the outer table: • the inner table is searched for qualifying records • & the results merged into a new record in an output table. SELECT CustomerID, Name, Street, City, State, Zip, TotalSales FROM Customers, Sales WHERE State = “MN” AND Customers.CustomerID = Sales.CustomerID

RELATIONAL DBs Table Access and Joins Nested Loop Joins • Appropriate: • When the outer table has only a few records (after applying predicates) • Where the inner table is small or has an index usable to access qualifying records • Drawbacks when the inner and outer tables are not indexed or clustered on the same values: • The inner table (including indexes) may be scanned repetitively to find matching records. • The outer table is processed inefficiently for records with the same value in the join columns.

RELATIONAL DBs Table Access and Joins Sort-Merge Joins • Algorithm where the 2 tables are scanned only once: • For each table, a temporary table is created of qualifying candidate records. • Each temporary table is sorted on the same columns based on the joining predicates (conditions). • Finally the two temporary tables are merged and joined (as an inner and outer table) into a third table. • If either of the original two tables has an index on the selection and joining predicates, it may be possible to skip the creation of the corresponding temporary table.

QUERY OPTIMIATION AND INDEXES DB/400 Components Research Problem/Objectives Data Management Methods Proposed Solution The Optimizer Proposed Scope The Database Monitor Summary

DB/400 Components • Query component -> Query optimizer • Cost based • Data management methods • Access paths • Access methods

DB/400 Data Management Methods • Access Paths • Sequential, also called arrival sequence • accesses data in physical order • Keyed sequential, uses indexes • accesses data in the order of the index • Access Methods • Dataspace scan - similar to table scan - parallel version too • Key selection - requires index - parallel version too • Key positioning - requires index - parallel version too • Index only - requires index - SMP required • Index-from-Index - requires index - SMP required • Hashing - SMP required

DB/400 The Optimizer • Query optimization is a tradeoff between • the time to determine an optimal implementation • the time to actually execute the query • The query optimizer • selects the most efficient access method at query run-time • identifies alternatives • estimates current costs • optimizes joins and grouping operations

DB/400 The Optimizer • The access cost is modeled for: • reading records without an index (a dataspace scan) • reading records with an index (key selection or key positioning)* • creating a temporary index on the relevant data • creating a temporary index on another index (index-from-index) • using the hashing method or a query sort routine *Each index is evaluated, unless a time limit is reached first. Indexes are examined in LIFO order.

DB/400 The Optimizer • The cost of a particular method is the sum of: • the start-up cost • the cost associated with the optimization mode** • the cost of creating any indexes • the cost for the expected number of page faults to read the data and the cost to process the expected number of rows **The optimization mode is given by a parameter, which defines the minimization goal as either the time to retrieve the first buffer of data or the time to retrieve all the rows that would be selected.

DB/400 The Database Monitor • The database monitor collects statistics and data about query implementation and performance during query optimization and execution. • For example, SQL can extract data showing: • Queries implemented as table scans • Queries taking the most time • Queries stopped by the query governor • All queries executed and data on each query, such as • Table names • Number of rows in a table and the number of rows selected • Estimated & actual execution time • Indexed advised & fields

DB/400 Research Problem/Objectives RESEARCH PROBLEM Improve query performance by determining an appropriate set of indexes. RESEARCH OBJECTIVES • Determine the information required & available. • Determine the formulas/algorithms needed to evaluate the data. • Determine the user interface (including capabilities) of an user-friendly intelligent index advisor.

DB/400 Proposed Scope • Create a prototype intelligent index manager. • Experiment with some basic queries. • Evaluate simple queries • no joins & no GROUP BY or ORDER BY clauses in SQL.

DB/400 Summary • IBM’s DB2/400 has several components: • The query component, including the query optimizer which: • chooses an minimal cost method to implement queries • validating or creating an access plan • a control structure containing the implementation data. • The data management methods: • algorithms for retrieving data through an access path using a table access method. • The database monitor: • collects statistics and implementation details about queries. • includes recommendations on creating indexes: • on a per query basis • based on a limited subset of the query optimizer rules.

DB/400 Summary • An intelligent index manager would: • need data on query implementation and execution from the database monitor. • The database monitor would need to be enhanced to save additional data from the query optimizer • Supplemental data may be needed such as • How often the records change in queried tables • User specifications • need algorithms, formulas & rules based on • The major ones used by the query optimizer. • Additional criteria, such as projected index maintenance costs & rules based on the user specifications

DB/400 Summary • An intelligent index manager would: • evaluate & judge if new indexes are needed; and if so, what indexes would provide the best overall system performance. • Decisions would be made on the makeup of the indexes: which tables and which fields. • Decisions would be based on all the queries within its scope • advise users recommending specific indexes • Further research is needed to resolve questionsconcerning exact implementation details: • specific data requirements, algorithms and formulas • an appropriate user-interface

Query Optimization and Indexes Thank you Mark Schoennauer BACKGROUND RESEARCH

Query Optimization and Indexes

Query Optimization and Indexes

Presentation Transcript

Query Optimization

Query Optimization

Query Optimization

QUERY OPTIMIZATION AND QUERY PROCESSING

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Processing and Query Optimization

Query Optimization

Query Optimization

Query Optimization

Indexes and Query Execution Plan

Query Optimization

Query optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization