Query processing and optimization
Sponsored Links
This presentation is the property of its rightful owner.
1 / 29

Query Processing and Optimization PowerPoint PPT Presentation


  • 197 Views
  • Uploaded on
  • Presentation posted in: General

Query Processing and Optimization. General Overview. Relational model - SQL Formal & commercial query languages Functional Dependencies Normalization Physical Design Indexing Query Processing and Optimization. Review: QP & O. SQL Query. Query Processor. Parser. Query Optimizer.

Download Presentation

Query Processing and Optimization

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Query Processing and Optimization


General Overview

  • Relational model - SQL

    • Formal & commercial query languages

  • Functional Dependencies

  • Normalization

  • Physical Design

  • Indexing

  • Query Processing and Optimization


Review: QP & O

SQL Query

Query

Processor

Parser

Query Optimizer

Algebraic

Expression

Execution plan

Evaluator

Data: result of the query


Review: QP & O

Query

Optimizer

Algebraic

Representation

Query Rewriter

Algebraic Representation

Data Stats

Plan Generator

Query Execution Plan


Review-Plan Generation

Metadata: DBMS maintains statistics about each relation, attribute and index.

Plan generation:

  • Generate many alternative plans

    • We saw many for selections, joins

  • Estimate cost for each and choose best

    Plans examined:

    Selection (exact match): Linear, binary, PI, SI

    Range: PI, SI

    Joins: NLJ, BNLJ, INLJ, SMJ, HJ


Review-Plan Generation

Depends upon a cost model

For any query, must know

its estimated cardinality

its estimated cost (in # of I/Os)

E.g.: A = K (R )

 cardinality SC(A, R)

 cost: depends on the plan, attribute:

Linear Scan bR /2 bR

Binary Searchlog2(bR) log2(bR)+sc(A, R)/fR -1

PI ScanHTi +1 HTi +sc(A, R) / fR


Cost Model

How do we predict the cost of a plan?

Ans: Cost model

  • For each plan operator and each algorithm we have a cost formula

  • Inputs to formulas depend on relations, attributes

  • Database maintains statistics about relations for this (Metadata)


Metadata

  • Given a relation r, DBMS likely maintains the following metadata:

  • Size (# of tuples) nr

  • Size (# of blocks) br

  • Block size (#tuples) fr

    (typicallybr =nr / fr )

  • Tuple size (in bytes) sr

  • Attribute Variance (for each attribute r, # of different values) V(att, r)

  • Selection Cardinality (for each attribute in r, expected size of a selection: att = K (r ) ) SC(att, r)


Cardinality of Joins in General

Assume join: R S

  • If R, S have no common attributes: nr*ns

  • If R,S have attribute A in common:

    (take min)

  • If R, S have attribute A in common and:

    • A is a candidate key for R: ≤ ns

    • A is candidate key in R and candidate key in S : ≤ min(nr, ns)

    • A is a key for R, foreign key for S: = ns


Join Operation

  • Size and plans for join operation

  • Running example: depositor customer

    Metadata:

    ncustomer = 10,000 ndepositor = 5000

    fcustomer = 25 fdepositor = 50

    bcustomer= 400 bdepositor= 100

    V(cname, depositor) = 2500 (each customer has on average 2 accts)

    cname in depositor is foreign key

depositor(cname, acct_no)

customer(cname, cstreet, ccity)


Nested-Loop Join

Query: R S

Algorithm 1: Nested Loop Join

Idea:

t1

u1

Blocks

of...

t2

u2

t3

u3

R

S

results

Compare: (t1, u1), (t1, u2), (t1, u3) .....

Then: GET NEXT BLOCK OF S

Repeat: for EVERY tuple of R


Join Algorithms

Query: R S

Algorithm 2: Block Nested Loop Join

Idea:

t1

u1

Blocks

of...

t2

u2

t3

u3

R

S

results

Compare: (t1, u1), (t1, u2), (t1, u3)

(t2, u1), (t2, u2), (t2, u3)

(t3, u1), (t3, u2), (t3, u3)

Then: GET NEXT BLOCK OF S

Repeat: for EVERY BLOCK of R


Block Nested-Loop Join

  • Block Nested Loop Join

    for each block BRofR dofor each block BSof S dofor each tuple trin BR dofor each tuple usin Bsdo beginCheck if (tr,us) satisfy the join condition if they do (“match”), add tr• usto the result.


Block Nested-Loop Join (Cont.)

Cost:

  • Worst case estimate: br bs + br block accesses.

  • Improvements to nested loop and block nested loop algorithms for a buffer with M blocks:

    • In block nested-loop, use M — 2 disk blocks as blocking unit for outer relations, where M = memory size in blocks; use remaining two blocks to buffer inner relation and output

      • Cost = br / (M-2)  bs + br

    • If equi-join attribute forms a key on inner relation, stop inner loop on first match

    • Scan inner loop forward and backward alternately, to make use of the blocks remaining in buffer (with LRU replacement)


Join Algorithms

Query: R S

Algorithm 3: Indexed Nested Loop Join

Idea:

t1

Blocks

of...

t2

t3

R

S

results

(fill w/

blocks of

S or index blocks)

For each tuple ti of R

if ti.A = K (A is the attribute R,S have in common)

then use the index to compute att = K (S )

Demands: index on A for S


Indexed Nested-Loop Join

Indexed Nested Loop Join

  • For each tuple tRin the outer relation R, use the index to look up tuples in S that satisfy the join condition with tuple tR.

  • Worst case: buffer has space for only one page of R, and, for each tuple in R, we perform an index lookup on s.

  • Cost of the join: br + nr c

    • Where c is the cost of traversing the index and fetching all matching s tuples for one tuple from r

    • c can be estimated as cost of a single selection on s using the join condition.

  • If indices are available on join attributes of both R and S,use the relation with fewer tuples as the outer relation.


Example of Nested-Loop Join Costs

Query: depositor customer

(cname, acct_no) (cname, ccity, cstreet)

Metadata:

customer: ncustomer = 10,000

fcustomer = 25 bcustomer = 400

depositor: ndepositor = 5000

fdepositor = 50 bdepositor = 100

V (cname, depositor) = 2500

i a primary index on cname (dense) for customer (fi = 20)

Minimal buffer


Plan generation for Joins

Alternative 1: Block Nested Loop

1a: customer = OUTER relation

depositor = INNER relation

cost: bcustomer + bdepositor * bcustomer = 400 +(400 *100) = 40,400

1b: customer = INNER relation

depositor = OUTER relation

cost: bdepositor + bdepositor * bcustomer = 100 +(400 *100) = 40,100


Plan generation for Joins

Alternative 2: Indexed Nested Loop

We have index on cname for customer. Depositor is the outer relation

Cost:

bdepositor + ndepositor * c = 100 +(5000 *c ) , c is the cost of evaluating a selection cname=K using index.

What is c? Primary index on cname, cname a key for customer

c = HTi +1


Plan generation for Joins

What is HTi ?

cname a key for customer. V(cname, customer) = 10,000

fi = 20, i is dense

LBi =  10,000/20 = 500

HTi ~ logfi(LBi) + 1 = log20 500 + 1 = 4

Cost of index nested loop is:

= 100 + (5000 * (4+1)) = 25,100 BA (cheaper than NLJ)


pR

pS

Another Join Strategy

Query: R S

Algorithm: Merge Join

Idea: suppose R, S are both sorted on A (A is the common attribute)

A

A

2

2

3

5

1

2

3

4

...

...

Compare:

(1, 2) advance pR

(2, 2) match, advance pS  add to result

(2, 2) match, advance pS  add to result

(2, 3) advance pR

(3, 3) match, advance pS add to result

(3, 5) advance pR

(4, 5) read next block of R


Merge-Join

GIVEN R, S both sorted on A

  • Initialization

    • Reserve blocks of R, S into buffer reserving one block for result

    • Pr= 1, Ps =1

  • Join (assuming no duplicate values on A in R)

    WHILE !EOF( R) && !EOF(S) DO

    if BR[Pr].A == BS[Ps].A then

    output to result; Ps++

    else if BR[Pr].A < BS[Ps].A then

    Pr++

    else (same for Ps)

    if Pr or Ps point past end of block,

    read next block and set Pr(Ps) to 1


Merge-Join (Cont.)

  • Each block needs to be read only once (assuming all tuples for any given value of the join attributes fit in memory)

  • Thus number of block accesses for merge-join is bR + bS

  • But....

    What if one/both of R,S not sorted on A?

    Ans: May be worth sorting first and then perform merge join (Sort-Merge Join)

    Cost: bR + bS + sortR + sortS


External Sorting

Not the same as internal sorting

Internal sorting:

 minimize CPU (count comparisons)

 best: quicksort, mergesort, ....

External sorting:

 minimize disk accesses (what we ‘re sorting doesn’t fit in memory!)

 best: external merge sort

WHEN used?

1) SORT-MERGE join

2) ORDER BY queries

3) SELECT DISTINCT (duplicate elimination)


d

e

g

m

p

r

31

16

24

3

2

16

External Sorting

Idea:

1. Sort fragments of file in memory using internal sort (runs). Store runs on disk.

2. Merge runs. E.g.:

a

b

c

19

14

33

a

d

g

19

31

24

sort

a

a

b

c

d

d

d

e

g

m

p

r

g

a

d

c

b

e

r

d

m

p

d

a

14

19

14

33

7

21

31

16

24

3

2

16

24

19

31

33

14

16

16

21

3

2

7

14

merge

sort

b

c

e

14

33

16

merge

sort

a

d

d

14

7

21

d

m

r

21

3

16

sort

merge

a

d

p

14

7

2


External Sorting (cont.)

Algorithm

Let M = size of buffer (in blocks)

1. Sort runs of size M blocks each (except for last) and store. Use internal sort on each run.

2. Merge M-1 runs at a time into 1 and store. Merge for all runs.

3. if step 2 results in more than 1 run, goto step 2.

Run

m-1

Output

Run 1

Run 2

........

Run 3


External Sorting (cont.)

Cost: 2 bR * (logM-1(bR / M) + 1)

Intuition:

Step 1: create runs

 every block read and written once

 cost 2 bR I/Os

Step 2: Merge

 every merge iteration requires reading and writing entire file (2 bR I/Os)

Total:

logM-1(bR / M)

Iteration #

---------------

1

2

3

.....

Runs Left to Merge

----------------------------


What if we need to sort?

Query: depositor customer

Merge-sort Join

Sorting depositor:

bdepositor = 100

Sort depositor = 2 * 100 * (log2(100 / 3) + 1)

= 1400

Similarly, for customer we get 7200 I/Os.

Total: 100 + 400 + 1400 + 7200 = 9100 I/O’s!

Still beats BNLJ (40K), INLJ (25K)

Why not use SMJ always?

Ans: 1) Sometimes inner relation can fit in memory

2) Sometimes index is small

3) SMJ only work for natural joins, “equijoins”


Hybrid Merge Join

  • hybrid merge-join: If one relation is sorted, and the other has a secondary B+-tree index on the join attribute

    • Merge the sorted relation with the leaf entries of the B+-tree .

    • Sort the result on the addresses of the unsorted relation’s tuples

    • Scan the unsorted relation in physical address order and merge with previous result, to replace addresses by the actual tuples

      • Sequential scan more efficient than random lookup


  • Login