1 / 29

# Query Processing and Optimization - PowerPoint PPT Presentation

Query Processing and Optimization. General Overview. Relational model - SQL Formal & commercial query languages Functional Dependencies Normalization Physical Design Indexing Query Processing and Optimization. Review: QP & O. SQL Query. Query Processor. Parser. Query Optimizer.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

## PowerPoint Slideshow about ' Query Processing and Optimization' - meara

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Query Processing and Optimization

• Relational model - SQL

• Formal & commercial query languages

• Functional Dependencies

• Normalization

• Physical Design

• Indexing

• Query Processing and Optimization

SQL Query

Query

Processor

Parser

Query Optimizer

Algebraic

Expression

Execution plan

Evaluator

Data: result of the query

Query

Optimizer

Algebraic

Representation

Query Rewriter

Algebraic Representation

Data Stats

Plan Generator

Query Execution Plan

Metadata: DBMS maintains statistics about each relation, attribute and index.

Plan generation:

• Generate many alternative plans

• We saw many for selections, joins

• Estimate cost for each and choose best

Plans examined:

Selection (exact match): Linear, binary, PI, SI

Range: PI, SI

Joins: NLJ, BNLJ, INLJ, SMJ, HJ

Depends upon a cost model

For any query, must know

its estimated cardinality

its estimated cost (in # of I/Os)

E.g.: A = K (R )

 cardinality SC(A, R)

 cost: depends on the plan, attribute:

Linear Scan bR /2 bR

Binary Search log2(bR) log2(bR)+sc(A, R)/fR -1

PI Scan HTi +1 HTi +sc(A, R) / fR

How do we predict the cost of a plan?

Ans: Cost model

• For each plan operator and each algorithm we have a cost formula

• Inputs to formulas depend on relations, attributes

• Database maintains statistics about relations for this (Metadata)

• Given a relation r, DBMS likely maintains the following metadata:

• Size (# of tuples) nr

• Size (# of blocks) br

• Block size (#tuples) fr

(typicallybr =nr / fr )

• Tuple size (in bytes) sr

• Attribute Variance (for each attribute r, # of different values) V(att, r)

• Selection Cardinality (for each attribute in r, expected size of a selection: att = K (r ) ) SC(att, r)

Assume join: R S

• If R, S have no common attributes: nr*ns

• If R,S have attribute A in common:

(take min)

• If R, S have attribute A in common and:

• A is a candidate key for R: ≤ ns

• A is candidate key in R and candidate key in S : ≤ min(nr, ns)

• A is a key for R, foreign key for S: = ns

• Size and plans for join operation

• Running example: depositor customer

Metadata:

ncustomer = 10,000 ndepositor = 5000

fcustomer = 25 fdepositor = 50

bcustomer= 400 bdepositor= 100

V(cname, depositor) = 2500 (each customer has on average 2 accts)

cname in depositor is foreign key

depositor(cname, acct_no)

customer(cname, cstreet, ccity)

Query: R S

Algorithm 1: Nested Loop Join

Idea:

t1

u1

Blocks

of...

t2

u2

t3

u3

R

S

results

Compare: (t1, u1), (t1, u2), (t1, u3) .....

Then: GET NEXT BLOCK OF S

Repeat: for EVERY tuple of R

Query: R S

Algorithm 2: Block Nested Loop Join

Idea:

t1

u1

Blocks

of...

t2

u2

t3

u3

R

S

results

Compare: (t1, u1), (t1, u2), (t1, u3)

(t2, u1), (t2, u2), (t2, u3)

(t3, u1), (t3, u2), (t3, u3)

Then: GET NEXT BLOCK OF S

Repeat: for EVERY BLOCK of R

• Block Nested Loop Join

for each block BRofR dofor each block BSof S do for each tuple trin BR do for each tuple usin Bsdo beginCheck if (tr,us) satisfy the join condition if they do (“match”), add tr• usto the result.

Cost:

• Worst case estimate: br bs + br block accesses.

• Improvements to nested loop and block nested loop algorithms for a buffer with M blocks:

• In block nested-loop, use M — 2 disk blocks as blocking unit for outer relations, where M = memory size in blocks; use remaining two blocks to buffer inner relation and output

• Cost = br / (M-2)  bs + br

• If equi-join attribute forms a key on inner relation, stop inner loop on first match

• Scan inner loop forward and backward alternately, to make use of the blocks remaining in buffer (with LRU replacement)

Query: R S

Algorithm 3: Indexed Nested Loop Join

Idea:

t1

Blocks

of...

t2

t3

R

S

results

(fill w/

blocks of

S or index blocks)

For each tuple ti of R

if ti.A = K (A is the attribute R,S have in common)

then use the index to compute att = K (S )

Demands: index on A for S

Indexed Nested Loop Join

• For each tuple tRin the outer relation R, use the index to look up tuples in S that satisfy the join condition with tuple tR.

• Worst case: buffer has space for only one page of R, and, for each tuple in R, we perform an index lookup on s.

• Cost of the join: br + nr c

• Where c is the cost of traversing the index and fetching all matching s tuples for one tuple from r

• c can be estimated as cost of a single selection on s using the join condition.

• If indices are available on join attributes of both R and S,use the relation with fewer tuples as the outer relation.

Query: depositor customer

(cname, acct_no) (cname, ccity, cstreet)

Metadata:

customer: ncustomer = 10,000

fcustomer = 25 bcustomer = 400

depositor: ndepositor = 5000

fdepositor = 50 bdepositor = 100

V (cname, depositor) = 2500

i a primary index on cname (dense) for customer (fi = 20)

Minimal buffer

Alternative 1: Block Nested Loop

1a: customer = OUTER relation

depositor = INNER relation

cost: bcustomer + bdepositor * bcustomer = 400 +(400 *100) = 40,400

1b: customer = INNER relation

depositor = OUTER relation

cost: bdepositor + bdepositor * bcustomer = 100 +(400 *100) = 40,100

Alternative 2: Indexed Nested Loop

We have index on cname for customer. Depositor is the outer relation

Cost:

bdepositor + ndepositor * c = 100 +(5000 *c ) , c is the cost of evaluating a selection cname=K using index.

What is c? Primary index on cname, cname a key for customer

c = HTi +1

What is HTi ?

cname a key for customer. V(cname, customer) = 10,000

fi = 20, i is dense

LBi =  10,000/20 = 500

HTi ~ logfi(LBi) + 1 = log20 500 + 1 = 4

Cost of index nested loop is:

= 100 + (5000 * (4+1)) = 25,100 BA (cheaper than NLJ)

pR

pS

Another Join Strategy

Query: R S

Algorithm: Merge Join

Idea: suppose R, S are both sorted on A (A is the common attribute)

A

A

2

2

3

5

1

2

3

4

...

...

Compare:

(1, 2) advance pR

(2, 2) match, advance pS  add to result

(2, 2) match, advance pS  add to result

(2, 3) advance pR

(3, 3) match, advance pS add to result

(3, 5) advance pR

(4, 5) read next block of R

GIVEN R, S both sorted on A

• Initialization

• Reserve blocks of R, S into buffer reserving one block for result

• Pr= 1, Ps =1

• Join (assuming no duplicate values on A in R)

WHILE !EOF( R) && !EOF(S) DO

if BR[Pr].A == BS[Ps].A then

output to result; Ps++

else if BR[Pr].A < BS[Ps].A then

Pr++

else (same for Ps)

if Pr or Ps point past end of block,

read next block and set Pr(Ps) to 1

• Each block needs to be read only once (assuming all tuples for any given value of the join attributes fit in memory)

• Thus number of block accesses for merge-join is bR + bS

• But....

What if one/both of R,S not sorted on A?

Ans: May be worth sorting first and then perform merge join (Sort-Merge Join)

Cost: bR + bS + sortR + sortS

Not the same as internal sorting

Internal sorting:

 minimize CPU (count comparisons)

 best: quicksort, mergesort, ....

External sorting:

 minimize disk accesses (what we ‘re sorting doesn’t fit in memory!)

 best: external merge sort

WHEN used?

1) SORT-MERGE join

2) ORDER BY queries

3) SELECT DISTINCT (duplicate elimination)

e

g

m

p

r

31

16

24

3

2

16

External Sorting

Idea:

1. Sort fragments of file in memory using internal sort (runs). Store runs on disk.

2. Merge runs. E.g.:

a

b

c

19

14

33

a

d

g

19

31

24

sort

a

a

b

c

d

d

d

e

g

m

p

r

g

a

d

c

b

e

r

d

m

p

d

a

14

19

14

33

7

21

31

16

24

3

2

16

24

19

31

33

14

16

16

21

3

2

7

14

merge

sort

b

c

e

14

33

16

merge

sort

a

d

d

14

7

21

d

m

r

21

3

16

sort

merge

a

d

p

14

7

2

Algorithm

Let M = size of buffer (in blocks)

1. Sort runs of size M blocks each (except for last) and store. Use internal sort on each run.

2. Merge M-1 runs at a time into 1 and store. Merge for all runs.

3. if step 2 results in more than 1 run, goto step 2.

Run

m-1

Output

Run 1

Run 2

........

Run 3

Cost: 2 bR * (logM-1(bR / M) + 1)

Intuition:

Step 1: create runs

 every block read and written once

 cost 2 bR I/Os

Step 2: Merge

 every merge iteration requires reading and writing entire file (2 bR I/Os)

Total:

logM-1(bR / M)

Iteration #

---------------

1

2

3

.....

Runs Left to Merge

----------------------------

Query: depositor customer

Merge-sort Join

Sorting depositor:

bdepositor = 100

Sort depositor = 2 * 100 * (log2(100 / 3) + 1)

= 1400

Similarly, for customer we get 7200 I/Os.

Total: 100 + 400 + 1400 + 7200 = 9100 I/O’s!

Still beats BNLJ (40K), INLJ (25K)

Why not use SMJ always?

Ans: 1) Sometimes inner relation can fit in memory

2) Sometimes index is small

3) SMJ only work for natural joins, “equijoins”

• hybrid merge-join: If one relation is sorted, and the other has a secondary B+-tree index on the join attribute

• Merge the sorted relation with the leaf entries of the B+-tree .

• Sort the result on the addresses of the unsorted relation’s tuples

• Scan the unsorted relation in physical address order and merge with previous result, to replace addresses by the actual tuples

• Sequential scan more efficient than random lookup