Query processing and optimization
This presentation is the property of its rightful owner.
Sponsored Links
1 / 29

Query Processing and Optimization PowerPoint PPT Presentation


  • 172 Views
  • Uploaded on
  • Presentation posted in: General

Query Processing and Optimization. General Overview. Relational model - SQL Formal & commercial query languages Functional Dependencies Normalization Physical Design Indexing Query Processing and Optimization. Review: QP & O. SQL Query. Query Processor. Parser. Query Optimizer.

Download Presentation

Query Processing and Optimization

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Query processing and optimization

Query Processing and Optimization


General overview

General Overview

  • Relational model - SQL

    • Formal & commercial query languages

  • Functional Dependencies

  • Normalization

  • Physical Design

  • Indexing

  • Query Processing and Optimization


Review qp o

Review: QP & O

SQL Query

Query

Processor

Parser

Query Optimizer

Algebraic

Expression

Execution plan

Evaluator

Data: result of the query


Review qp o1

Review: QP & O

Query

Optimizer

Algebraic

Representation

Query Rewriter

Algebraic Representation

Data Stats

Plan Generator

Query Execution Plan


Review plan generation

Review-Plan Generation

Metadata: DBMS maintains statistics about each relation, attribute and index.

Plan generation:

  • Generate many alternative plans

    • We saw many for selections, joins

  • Estimate cost for each and choose best

    Plans examined:

    Selection (exact match): Linear, binary, PI, SI

    Range: PI, SI

    Joins: NLJ, BNLJ, INLJ, SMJ, HJ


Review plan generation1

Review-Plan Generation

Depends upon a cost model

For any query, must know

its estimated cardinality

its estimated cost (in # of I/Os)

E.g.: A = K (R )

 cardinality SC(A, R)

 cost: depends on the plan, attribute:

Linear Scan bR /2 bR

Binary Searchlog2(bR) log2(bR)+sc(A, R)/fR -1

PI ScanHTi +1 HTi +sc(A, R) / fR


Cost model

Cost Model

How do we predict the cost of a plan?

Ans: Cost model

  • For each plan operator and each algorithm we have a cost formula

  • Inputs to formulas depend on relations, attributes

  • Database maintains statistics about relations for this (Metadata)


Metadata

Metadata

  • Given a relation r, DBMS likely maintains the following metadata:

  • Size (# of tuples) nr

  • Size (# of blocks) br

  • Block size (#tuples) fr

    (typicallybr =nr / fr )

  • Tuple size (in bytes) sr

  • Attribute Variance (for each attribute r, # of different values) V(att, r)

  • Selection Cardinality (for each attribute in r, expected size of a selection: att = K (r ) ) SC(att, r)


Cardinality of joins in general

Cardinality of Joins in General

Assume join: R S

  • If R, S have no common attributes: nr*ns

  • If R,S have attribute A in common:

    (take min)

  • If R, S have attribute A in common and:

    • A is a candidate key for R: ≤ ns

    • A is candidate key in R and candidate key in S : ≤ min(nr, ns)

    • A is a key for R, foreign key for S: = ns


Join operation

Join Operation

  • Size and plans for join operation

  • Running example: depositor customer

    Metadata:

    ncustomer = 10,000 ndepositor = 5000

    fcustomer = 25 fdepositor = 50

    bcustomer= 400 bdepositor= 100

    V(cname, depositor) = 2500 (each customer has on average 2 accts)

    cname in depositor is foreign key

depositor(cname, acct_no)

customer(cname, cstreet, ccity)


Nested loop join

Nested-Loop Join

Query: R S

Algorithm 1: Nested Loop Join

Idea:

t1

u1

Blocks

of...

t2

u2

t3

u3

R

S

results

Compare: (t1, u1), (t1, u2), (t1, u3) .....

Then: GET NEXT BLOCK OF S

Repeat: for EVERY tuple of R


Join algorithms

Join Algorithms

Query: R S

Algorithm 2: Block Nested Loop Join

Idea:

t1

u1

Blocks

of...

t2

u2

t3

u3

R

S

results

Compare: (t1, u1), (t1, u2), (t1, u3)

(t2, u1), (t2, u2), (t2, u3)

(t3, u1), (t3, u2), (t3, u3)

Then: GET NEXT BLOCK OF S

Repeat: for EVERY BLOCK of R


Block nested loop join

Block Nested-Loop Join

  • Block Nested Loop Join

    for each block BRofR dofor each block BSof S dofor each tuple trin BR dofor each tuple usin Bsdo beginCheck if (tr,us) satisfy the join condition if they do (“match”), add tr• usto the result.


Block nested loop join cont

Block Nested-Loop Join (Cont.)

Cost:

  • Worst case estimate: br bs + br block accesses.

  • Improvements to nested loop and block nested loop algorithms for a buffer with M blocks:

    • In block nested-loop, use M — 2 disk blocks as blocking unit for outer relations, where M = memory size in blocks; use remaining two blocks to buffer inner relation and output

      • Cost = br / (M-2)  bs + br

    • If equi-join attribute forms a key on inner relation, stop inner loop on first match

    • Scan inner loop forward and backward alternately, to make use of the blocks remaining in buffer (with LRU replacement)


Join algorithms1

Join Algorithms

Query: R S

Algorithm 3: Indexed Nested Loop Join

Idea:

t1

Blocks

of...

t2

t3

R

S

results

(fill w/

blocks of

S or index blocks)

For each tuple ti of R

if ti.A = K (A is the attribute R,S have in common)

then use the index to compute att = K (S )

Demands: index on A for S


Indexed nested loop join

Indexed Nested-Loop Join

Indexed Nested Loop Join

  • For each tuple tRin the outer relation R, use the index to look up tuples in S that satisfy the join condition with tuple tR.

  • Worst case: buffer has space for only one page of R, and, for each tuple in R, we perform an index lookup on s.

  • Cost of the join: br + nr c

    • Where c is the cost of traversing the index and fetching all matching s tuples for one tuple from r

    • c can be estimated as cost of a single selection on s using the join condition.

  • If indices are available on join attributes of both R and S,use the relation with fewer tuples as the outer relation.


Example of nested loop join costs

Example of Nested-Loop Join Costs

Query: depositor customer

(cname, acct_no) (cname, ccity, cstreet)

Metadata:

customer: ncustomer = 10,000

fcustomer = 25 bcustomer = 400

depositor: ndepositor = 5000

fdepositor = 50 bdepositor = 100

V (cname, depositor) = 2500

i a primary index on cname (dense) for customer (fi = 20)

Minimal buffer


Plan generation for joins

Plan generation for Joins

Alternative 1: Block Nested Loop

1a: customer = OUTER relation

depositor = INNER relation

cost: bcustomer + bdepositor * bcustomer = 400 +(400 *100) = 40,400

1b: customer = INNER relation

depositor = OUTER relation

cost: bdepositor + bdepositor * bcustomer = 100 +(400 *100) = 40,100


Plan generation for joins1

Plan generation for Joins

Alternative 2: Indexed Nested Loop

We have index on cname for customer. Depositor is the outer relation

Cost:

bdepositor + ndepositor * c = 100 +(5000 *c ) , c is the cost of evaluating a selection cname=K using index.

What is c? Primary index on cname, cname a key for customer

c = HTi +1


Plan generation for joins2

Plan generation for Joins

What is HTi ?

cname a key for customer. V(cname, customer) = 10,000

fi = 20, i is dense

LBi =  10,000/20 = 500

HTi ~ logfi(LBi) + 1 = log20 500 + 1 = 4

Cost of index nested loop is:

= 100 + (5000 * (4+1)) = 25,100 BA (cheaper than NLJ)


Another join strategy

pR

pS

Another Join Strategy

Query: R S

Algorithm: Merge Join

Idea: suppose R, S are both sorted on A (A is the common attribute)

A

A

2

2

3

5

1

2

3

4

...

...

Compare:

(1, 2) advance pR

(2, 2) match, advance pS  add to result

(2, 2) match, advance pS  add to result

(2, 3) advance pR

(3, 3) match, advance pS add to result

(3, 5) advance pR

(4, 5) read next block of R


Merge join

Merge-Join

GIVEN R, S both sorted on A

  • Initialization

    • Reserve blocks of R, S into buffer reserving one block for result

    • Pr= 1, Ps =1

  • Join (assuming no duplicate values on A in R)

    WHILE !EOF( R) && !EOF(S) DO

    if BR[Pr].A == BS[Ps].A then

    output to result; Ps++

    else if BR[Pr].A < BS[Ps].A then

    Pr++

    else (same for Ps)

    if Pr or Ps point past end of block,

    read next block and set Pr(Ps) to 1


Merge join cont

Merge-Join (Cont.)

  • Each block needs to be read only once (assuming all tuples for any given value of the join attributes fit in memory)

  • Thus number of block accesses for merge-join is bR + bS

  • But....

    What if one/both of R,S not sorted on A?

    Ans: May be worth sorting first and then perform merge join (Sort-Merge Join)

    Cost: bR + bS + sortR + sortS


External sorting

External Sorting

Not the same as internal sorting

Internal sorting:

 minimize CPU (count comparisons)

 best: quicksort, mergesort, ....

External sorting:

 minimize disk accesses (what we ‘re sorting doesn’t fit in memory!)

 best: external merge sort

WHEN used?

1) SORT-MERGE join

2) ORDER BY queries

3) SELECT DISTINCT (duplicate elimination)


External sorting1

d

e

g

m

p

r

31

16

24

3

2

16

External Sorting

Idea:

1. Sort fragments of file in memory using internal sort (runs). Store runs on disk.

2. Merge runs. E.g.:

a

b

c

19

14

33

a

d

g

19

31

24

sort

a

a

b

c

d

d

d

e

g

m

p

r

g

a

d

c

b

e

r

d

m

p

d

a

14

19

14

33

7

21

31

16

24

3

2

16

24

19

31

33

14

16

16

21

3

2

7

14

merge

sort

b

c

e

14

33

16

merge

sort

a

d

d

14

7

21

d

m

r

21

3

16

sort

merge

a

d

p

14

7

2


External sorting cont

External Sorting (cont.)

Algorithm

Let M = size of buffer (in blocks)

1. Sort runs of size M blocks each (except for last) and store. Use internal sort on each run.

2. Merge M-1 runs at a time into 1 and store. Merge for all runs.

3. if step 2 results in more than 1 run, goto step 2.

Run

m-1

Output

Run 1

Run 2

........

Run 3


External sorting cont1

External Sorting (cont.)

Cost: 2 bR * (logM-1(bR / M) + 1)

Intuition:

Step 1: create runs

 every block read and written once

 cost 2 bR I/Os

Step 2: Merge

 every merge iteration requires reading and writing entire file (2 bR I/Os)

Total:

logM-1(bR / M)

Iteration #

---------------

1

2

3

.....

Runs Left to Merge

----------------------------


What if we need to sort

What if we need to sort?

Query: depositor customer

Merge-sort Join

Sorting depositor:

bdepositor = 100

Sort depositor = 2 * 100 * (log2(100 / 3) + 1)

= 1400

Similarly, for customer we get 7200 I/Os.

Total: 100 + 400 + 1400 + 7200 = 9100 I/O’s!

Still beats BNLJ (40K), INLJ (25K)

Why not use SMJ always?

Ans: 1) Sometimes inner relation can fit in memory

2) Sometimes index is small

3) SMJ only work for natural joins, “equijoins”


Hybrid merge join

Hybrid Merge Join

  • hybrid merge-join: If one relation is sorted, and the other has a secondary B+-tree index on the join attribute

    • Merge the sorted relation with the leaf entries of the B+-tree .

    • Sort the result on the addresses of the unsorted relation’s tuples

    • Scan the unsorted relation in physical address order and merge with previous result, to replace addresses by the actual tuples

      • Sequential scan more efficient than random lookup


  • Login