query optimization and indexes n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Query Optimization and Indexes PowerPoint Presentation
Download Presentation
Query Optimization and Indexes

Loading in 2 Seconds...

play fullscreen
1 / 47

Query Optimization and Indexes - PowerPoint PPT Presentation


  • 109 Views
  • Uploaded on

Query Optimization and Indexes. Query Optimization and Indexes. Introduction. Relational Databases. DB2/400. QUERY OPTIMIATION AND INDEXES. Introduction. Overview. Research Problem. Literature. INTRODUCTION. Overview. IBM has a DBMS called DB2. Query optimization

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Query Optimization and Indexes


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
query optimization and indexes1

Query Optimization and Indexes

Introduction

Relational Databases

DB2/400

introduction

QUERY OPTIMIATION AND INDEXES

Introduction

Overview

Research Problem

Literature

overview

INTRODUCTION

Overview
  • IBM has a DBMS called DB2.
  • Query optimization
    • Can significantly improve database performance
    • If the right indexes are available, queries can generally be implemented using better performing algorithms.
research problem

INTRODUCTION

Research Problem
  • Due to the complexity of the query optimizer, customers are often baffled about indexes and query response time. Users need advise on what indexes would offer the best performance.
  • Management asks:

How can an intelligent index manager be created?

  • Related question:

How can queries be evaluated to determine a set of indexes which may minimize the total cost of database transactions, thus optimizing performance?

literature

INTRODUCTION

Literature
  • Index selection problem (ISP) for secondary indexes is a well-known optimization problem.
  • This problem known to be NP-Complete.
query optimization and indexes2

Query Optimization and Indexes

Introduction

Relational Databases

DB2/400

relational databases

QUERY OPTIMIATION AND INDEXES

Relational Databases

Indexes

Background

Query Optimization

Table Access and Joins

background

RELATIONAL DBs

Background
  • Relational model introduced by Codd
  • Based on relations (tables), tuples (records), and attributes (fields)
  • Queries can be represented using relational algebra, relational calculus, or as a graph.
  • SQL is a high level query language.
query optimization

RELATIONAL DBs

Query Optimization
  • A query optimizer takes the query and creates a procedural sequence of implementation steps known as an access plan.
    • This plan is created after analyzing various alternatives to arrive at the best choice.
    • Table indexes are critical in creating efficient access plans.
  • Objectives of query optimization:
    • Minimize response time
    • Minimize usage of system resources
query optimization objectives

RELATIONAL DBs

Query Optimization Objectives
  • Hardware objectives include minimizing:
    • CPU costs
    • Communications costs (in a network)
    • I/O costs for accessing secondary storage
    • Cost of using main memory and secondary storage
  • Software objectives include minimizing:
    • Cost of query optimization (should be small compared to execution cost)
query optimization process

RELATIONAL DBs

Query Optimization Process

Rewriter

Algebraic Space

Cost Model

Planner

Method Structure

Space

Size-Distribution

Estimator

query optimization process1

RELATIONAL DBs

Query Optimization Process
  • Rewriter
    • creates an internal query representation.
    • applies transformations

to streamline query evaluation.

query optimization process2

RELATIONAL DBs

Query Optimization Process
  • Planner
    • maps the transformed query
      • into various sequences of operations

(algebraic space)

      • which can be implemented

(method structure space)

      • with a known cost

(cost model & size distribution),

creating candidate access plans.

    • computes the cost of each candidate plan,

and chooses the cheapest one.

indexes

RELATIONAL DBs

Indexes

B-Tree Indexes

Clustered/Non-clustered Indexes

Hash Indexes

introduction1

RELATIONAL DBs

Indexes

Introduction
  • Memory access on the order of nanoseconds
    • For example, 40 (.00000004 sec.)
  • Disk access on the order of milliseconds
    • For example, 25 (.025 sec.) i.e., 40 random I/O’s per sec.
      • .016 sec. Seek time (Move disk arm to the proper cylinder)
      • .008 sec. Rotational latency (Rotate platter into position)
      • .001 sec. Transfer time (to read/write data)
      • .025 sec. Total
  • $ cost of disk cheap compared to RAM
b tree indexes

RELATIONAL DBs

Indexes

B-Tree Indexes
  • Composition
    • Root level node (one at the first level)
    • Directory (or index) nodes (usually one or two levels)
    • Leaf level nodes (the bottom level, pointing to records)

... 221 np 346 np 398 ...

… 278 rid 305 rid 346 rid …

… 377 rid 411 rid 449 rid …

np = node pointer to disk page

rid = relative row ID to actual record

b tree indexes1

RELATIONAL DBs

Indexes

B-Tree Indexes
  • Significantly, due to the frequency of access, upper-levels of a B-tree index may remain in memory.
  • For example,

to access a record in a million record database:

assuming fanout of 256

    • B-tree index has 3 levels CEIL(log256 1,000,000)
    • B-tree index probed 3 times reading 3 pages into RAM
    • Only 1 disk I/O may be required to read in the disk page holding the leaf node pointing to the desired record.
b tree indexes2

RELATIONAL DBs

Indexes

B-Tree Indexes
  • B-tree index on a one million record database

has 34 nodes above the leaf nodes

    • Assume:
      • key values 4 bytes long (an integer)
      • node pointers & row IDs 4 bytes long
      • fill factor 70%
      • disk page header 48 bytes
      • disk page size 2 KB
    • so (2048-48) *.70/(4+4) 175 entries per disk page
      • CEIL(1,000,000/175) 5715 disk pages for leaf nodes
      • CEIL(5715/175) 33 disk pages for directory nodes
      • CEIL(33/175) 1 disk page for parent node(s), root
clustered non clustered indexes

RELATIONAL DBs

Indexes

Clustered/Non-clustered Indexes
  • In a clustered index the records are stored in the same order as the key.
  • Provides superior performance, for example:
    • Database of 10 million customer in 200 cities
    • Customer records of 100 bytes
      • so 20 data records per disk page (2048-48)/100
      • so 500,000 disk pages for database (10,000,000/20)
    • Mailing for a specific city averages 50,000 customers
    • If clustered index on city, only 2500 disk pages of data (50,000/20) need to be scanned, taking 1 min. (2500/40*60)
    • If non-clustered, we could assume scan 50,000 disk pages of data, taking 20 min. (50,000/40*60)
hash indexes

RELATIONAL DBs

Indexes

Hash Indexes
  • No disk file of keys, only a single I/O required (ideally) to read a record from disk.
  • The record key is the input value to a some hash function, whose output becomes the key to a disk page or relative record position.
  • Collisions, well-known problem, where 2 records have the same hash value.
    • Ideally, rehash to same disk page to minimize I/O.
  • Generally, cannot increase size of hash table.
    • Extensible hashing allows hash table to grow, based on a linear hashing algorithm.
table access and joins

RELATIONAL DBs

Table Access and Joins

No Indexes

Nested Loop Joins

Simple Indexes

Sort-Merge Joins

Composite Indexes

Multiple Indexes

introduction2

RELATIONAL DBs

Table Access and Joins

Introduction
  • When a database system receives a query

it compiles the query.

    • This includes syntax checking and query optimization.
    • The result of the compilation step is an access plan (a series of steps specific to the computer and database) which will execute the query at run time.
  • Query optimizer
    • Minimizes CPU time & number of I/O requests.
    • CPU memory, although important, may be limited to certain established levels.
no indexes

RELATIONAL DBs

Table Access and Joins

No Indexes
  • A table scan is an access step where all the rows in a table are sequentially searched.
    • The data collected is restricted by the WHERE clause of an SQL statement.
  • Access plan (DB2/MVS):
    • ACCESSTYPE column
      • Letter R, for a table scan
    • PREFETCH column
      • Blank, if random I/O
      • Letter S, if multi-block I/O (called sequential prefetch)
simple composite indexes

RELATIONAL DBs

Table Access and Joins

Simple/Composite Indexes
  • A matching index scan:
    • implemented in DB2/MVS using a B-tree index.
    • based on a single index
    • rows retrieved from a table based on the condition(s) specified in the WHERE clause of the SELECT statement.
simple indexes

RELATIONAL DBs

Table Access and Joins

Simple Indexes
  • The access plan for a SELECT statement may use indexes to limit the number of rows searched in a database. For example:

If ZipIdx exists on ZipCode in the Customers table,

SELECT * FROM Customers WHERE ZipCode = 56001

implemented in DB2/MVS using a matching index scan.

  • The plan for this query takes one step and is designated by the following DB2/MVS access plan:

ACCESSTYPE = I for an index scan

ACCESSNAME = ZipIdx

MATCHCOLS = 1

composite indexes

RELATIONAL DBs

Table Access and Joins

Composite Indexes
  • A matching index scan retrieves records from a table when the column components of the index can be matched to the predicates of the WHERE clause. For example:

If MailIdx exists on ZipCode + IncomeLevel + MaritalStatus

SELECT Name, Address FROM Customers

WHERE ZipCode = 56001 AND IncomeLevel = 10

implemented in DB2/MVS using a matching index scan.

  • The DB2 access plan will have:

ACCESSTYPE = I for an index scan

ACCESSNAME = MailIdx

MATCHCOLS = 2

multiple indexes

RELATIONAL DBs

Table Access and Joins

Multiple Indexes
  • Multiple index access used when different indexes on the predicates in the WHERE clause. For example:

SELECT Name, Address FROM Customers

WHERE ZipCode = 56001

AND (IncomeLevel = 10 OR MaritalStatus = “M”)

DB2 Access Plan assuming ZipIdx, IncomeIdx, and MaritalIdx :

TNAME MATCHCOLS PREFETCH

ACCESSTYPE ACCESSNAME MIXOPSEQ

Customers M 0 L 0

Customers MX 1 IncomeIdx S 1

Customers MX 1 MaritalIdx S 2

Customers MU 0 3

Customers MX 1 ZipIdx S 4

Customers MI 0 5

overview1

RELATIONAL DBs

Table Access and Joins

Overview
  • Common methods to join tables
    • Nested Loop
    • Sort Merge
    • Hash
  • In DB2, multi-step access plans for join processing make use of columns called:
    • PLANNO(1, 2, …) to indicate which step
    • METHOD(1=Nested Loop, 2=Sort Merge, 4=Hybrid Join) to indicate the join technique chosen by the query optimizer
    • TABNO(such as 1 or 2) to indicate whether we are extracting rows from the first or second table of the join
    • SORTN_JOIN(Y or N) to indicate whether a sort is required (for example, for the sort merge join)
overview2

RELATIONAL DBs

Table Access and Joins

Overview
  • A join of two tables occurs in two steps
    • where one table becomes the outer table
    • and the other becomes the inner table.
  • During join processing
    • records from outer table are presented one-by-one to the inner table
    • and the inner table searched for records matching the one presented to it.
nested loop joins

RELATIONAL DBs

Table Access and Joins

Nested Loop Joins
  • Process:
    • Records of the outer table are retrieved (or presented) using a table scan (or indexing, if possible).
      • Only candidate records are retrieved.
      • Candidate records satisfy local predicates.
    • For each retrieved row from the outer table:
      • the inner table is searched for qualifying records
      • & the results merged into a new record in an output table.

SELECT CustomerID, Name, Street, City, State, Zip, TotalSales

FROM Customers, Sales

WHERE State = “MN”

AND Customers.CustomerID = Sales.CustomerID

nested loop joins1

RELATIONAL DBs

Table Access and Joins

Nested Loop Joins
  • Appropriate:
    • When the outer table has only a few records (after applying predicates)
    • Where the inner table is small or has an index usable to access qualifying records
  • Drawbacks when the inner and outer tables are not indexed or clustered on the same values:
    • The inner table (including indexes) may be scanned repetitively to find matching records.
    • The outer table is processed inefficiently for records with the same value in the join columns.
sort merge joins

RELATIONAL DBs

Table Access and Joins

Sort-Merge Joins
  • Algorithm where the 2 tables are scanned only once:
    • For each table, a temporary table is created of qualifying candidate records.
    • Each temporary table is sorted on the same columns based on the joining predicates (conditions).
    • Finally the two temporary tables are merged and joined (as an inner and outer table) into a third table.
  • If either of the original two tables has an index on the selection and joining predicates, it may be possible to skip the creation of the corresponding temporary table.
query optimization and indexes3

Query Optimization and Indexes

Introduction

Relational Databases

DB2/400

db 400

QUERY OPTIMIATION AND INDEXES

DB/400

Components

Research Problem/Objectives

Data Management Methods

Proposed Solution

The Optimizer

Proposed Scope

The Database Monitor

Summary

components

DB/400

Components
  • Query component

-> Query optimizer

      • Cost based
  • Data management methods
    • Access paths
    • Access methods
data management methods

DB/400

Data Management Methods
  • Access Paths
    • Sequential, also called arrival sequence
      • accesses data in physical order
    • Keyed sequential, uses indexes
      • accesses data in the order of the index
  • Access Methods
    • Dataspace scan - similar to table scan - parallel version too
    • Key selection - requires index - parallel version too
    • Key positioning - requires index - parallel version too
    • Index only - requires index - SMP required
    • Index-from-Index - requires index - SMP required
    • Hashing - SMP required
the optimizer

DB/400

The Optimizer
  • Query optimization is a tradeoff between
    • the time to determine an optimal implementation
    • the time to actually execute the query
  • The query optimizer
    • selects the most efficient access method at query run-time
      • identifies alternatives
      • estimates current costs
    • optimizes joins and grouping operations
the optimizer1

DB/400

The Optimizer
  • The access cost is modeled for:
    • reading records without an index

(a dataspace scan)

    • reading records with an index

(key selection or key positioning)*

    • creating a temporary index on the relevant data
    • creating a temporary index on another index

(index-from-index)

    • using the hashing method or a query sort routine

*Each index is evaluated, unless a time limit is reached first. Indexes are examined in LIFO order.

the optimizer2

DB/400

The Optimizer
  • The cost of a particular method is the sum of:
    • the start-up cost
    • the cost associated with the optimization mode**
    • the cost of creating any indexes
    • the cost for the expected number of page faults to read the data and the cost to process the expected number of rows

**The optimization mode is given by a parameter, which defines the minimization goal as either the time to retrieve the first buffer of data or the time to retrieve all the rows that would be selected.

the database monitor

DB/400

The Database Monitor
  • The database monitor collects statistics and data about query implementation and performance during query optimization and execution.
  • For example, SQL can extract data showing:
    • Queries implemented as table scans
    • Queries taking the most time
    • Queries stopped by the query governor
    • All queries executed and data on each query, such as
      • Table names
      • Number of rows in a table and the number of rows selected
      • Estimated & actual execution time
      • Indexed advised & fields
research problem objectives

DB/400

Research Problem/Objectives

RESEARCH PROBLEM

Improve query performance

by determining an appropriate set of indexes.

RESEARCH OBJECTIVES

  • Determine the information required & available.
  • Determine the formulas/algorithms

needed to evaluate the data.

  • Determine the user interface

(including capabilities)

of an user-friendly intelligent index advisor.

proposed scope

DB/400

Proposed Scope
  • Create a prototype intelligent index manager.
    • Experiment with some basic queries.
  • Evaluate simple queries
    • no joins & no GROUP BY or ORDER BY clauses in SQL.
summary

DB/400

Summary
  • IBM’s DB2/400 has several components:
    • The query component, including the query optimizer which:
      • chooses an minimal cost method to implement queries
      • validating or creating an access plan
        • a control structure containing the implementation data.
    • The data management methods:
      • algorithms for retrieving data through an access path using a table access method.
  • The database monitor:
    • collects statistics and implementation details about queries.
    • includes recommendations on creating indexes:
      • on a per query basis
      • based on a limited subset of the query optimizer rules.
summary1

DB/400

Summary
  • An intelligent index manager would:
    • need data on query implementation and execution from the database monitor.
      • The database monitor would need to be enhanced to save additional data from the query optimizer
      • Supplemental data may be needed such as
        • How often the records change in queried tables
        • User specifications
    • need algorithms, formulas & rules based on
      • The major ones used by the query optimizer.
      • Additional criteria, such as projected index maintenance costs & rules based on the user specifications
summary2

DB/400

Summary
  • An intelligent index manager would:
    • evaluate & judge if new indexes are needed; and if so, what indexes would provide the best overall system performance.
      • Decisions would be made on the makeup of the indexes: which tables and which fields.
      • Decisions would be based on all the queries within its scope
    • advise users recommending specific indexes
  • Further research is needed to resolve questionsconcerning exact implementation details:
    • specific data requirements, algorithms and formulas
    • an appropriate user-interface
query optimization and indexes4

Query Optimization and Indexes

Thank you

Mark Schoennauer

BACKGROUND RESEARCH