evaluation of relational operations
Download
Skip this Video
Download Presentation
Evaluation of Relational Operations

Loading in 2 Seconds...

play fullscreen
1 / 19

Evaluation of Relational Operations - PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on

Evaluation of Relational Operations. Notation. M = Memory allocated for pages T R = # of tuples in R. b R = # of tuples of R per page. B R = # of pages (blocks) occupied by R. b = # of bytes per page. I R.A = Image of attribute A in R = # of distinct values of A in R.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Evaluation of Relational Operations' - teranika-fullerton


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
notation
Notation
  • M = Memory allocated for pages
  • TR= # of tuples in R.
  • bR = # of tuples of R per page.
  • BR= # of pages (blocks) occupied by R.
  • b = # of bytes per page.
  • IR.A = Image of attribute A in R = # of distinct values of A in R.
  • C = Cost of the relational operation.
  • Cost metric: # of I/Os. We will ignore output costs.
selection
Selection

SELECT *

FROMSailors S

WHERE S.rating = 10

  • Number of tuples with rating =10 is TS / IS.rating; v.g., ifTS.rating =100, and TS = 10,000 there are, on average, 100 sailors with rating = 10.
  • If there is a non-clustered index on rating, in the worst case each satisfying tuple can be in a different page => Cost = TS / IS.rating(note that here the unit is # of I/O)
  • If there is a clustered index on rating, all satisfying tuples are packed together, thus Cost = TS / (bS * IS.rating) = BS / IS.rating
  • Whenever Cost >BS => Scan the relation and check the condition on-the-fly.
  • If condition is rating < 10, if no other information exists, assume that 1/2 of the tuples satisfy the inequality => Cost = 0.5 TS or 0.5 BS with unclustered or clustered index, respectively.
selection1
Selection

SELECT *

FROMSailors S

WHERE S.rating = 10 AND S. age = 20

  • Number of tuples with rating =10 and age = 20is TS /(IS.rating * IS.age);
  • The images of rating and age must be considered. The file will be accessed through the index with the higher image. If IS.rating =100 and IS.age =10 (10 different ages for sailors), on the average there will be 1000 sailors with age=20 and 100 sailors with rating = 10 => access the file through the index over rating, and with the pages in main memory, check if age = 20. This is called the Most selective access path approach.
  • A compound index may also be used.
selection2
Selection

SELECT *

FROMSailors S

WHERE S.rating = 10 OR S. age = 20

  • Number of tuples with rating =10 or age = 20is (roughly) TS /IS.rating + TS IS.age;
  • We can only use indices if both attributes are indexed. If IS.rating =100 and IS.age =10, in the worst case, the 100 sailors with rating = 10 can be all different than the 1000 sailors with age = 20 => the strategy of the most selective path does not work.
  • If there is an index over rating and there is NO index over age, scan the relation.
intersection union of rids
Intersection (Union) of Rids
  • Another approach for disjunctive or conjunctive queries
    • Get sets of rids of data records using each matching index.
    • Then intersect these sets of rids
    • Retrieve the records and apply any remaining terms.
    • Consider day<8/9/94 AND bid=5 AND sid=3. If we have a B+ tree index on day and an index on sid, we can retrieve rids of records satisfying day<8/9/94 using the first, rids of recs satisfying sid=3 using the second, intersect, retrieve records and check bid=5.
    • We can proceed analogously with disjunctions, but instead of intersection, we apply union of rids.
cross product
Cross Product

SELECT *

FROM R , S

  • Assume BR < BS
  • Two cases: M > BR and M < BR
  • If M > BR , load all R into memory, and one page of S at a time.

Load R in main memory (cost= BR);

foreach page ps in S do

Load ps into main memory

for each tuple r in R do

for each tuple s in ps do

add <r, s> to result

C = BR +BS

cross product cont
Cross Product (cont.)
  • If M < BR
  • Partition R in segments of M-1 pages => BR /(M-1) segments
  • Load one segment of R and one page of S at a time.

for each segment Ri of R

Load segment Ri in main memory;

foreach page ps in S do

Load ps in main memory

for each tuple s in ps do

for each tuple r in Ri do

add <r, s> to result

C = BR +BS *BR /(M-1)

equality joins with one join column
Equality Joins With One Join Column

SELECT *

FROM R,S

WHERE R.B=S.B

  • In algebra: R S. Common! Must be carefully optimized. R S is large; so, R S followed by a selection is inefficient (except for small relations).
  • First alternative: page-oriented nested loops as seen before. But previously to the output <r,s>, check the join condition.
cost of joins cont
Cost of Joins (cont.)
  • Number of tuples in the output of a join R S
  • For each tuple r in R, the estimate # of joining tuples isTS / IS.B . Thus, for all the tuples, TRxTS / IS.A. The same could be done in terms of S, thus,
  • # of tuples in R S =TRxTS / max (IR.B , IS.B)
  • In general, if the join is a conjunction of equalities:

# of tuples in R S =TRxTS / (max (IR.B , IS.B) x max (IR.C , IS.C) x ….)

index nested loops join
Index Nested Loops Join
  • If there is an index on the join column of one relation (say S), can make it the inner and exploit the index.

read R into main memory

foreach tuple r in R do

use index on S.B for accessing pages of S with tuples s.t. s.B == r.B

for each tuple that matches, add <r, s> to result

jones 25

smith 2

….

25

25 red

2 blue

….

2

25 pink

4 black

….

R

Index on S.B

S

index nested loops join cont
Index Nested Loops Join (cont.)
  • If the index is non-clustered
    • Cost. for each tuple r in R, the # of matching tuples in S is

TS / IS.B ; then, Cost = BR + TS *TS / IS.B

  • For a clustered index:
    • Cost. for each tuple r in R, the # of matching tuples in S is

TS / IS.B ; but matching tuples are packed together; then,

Cost = BR + TS *BS / IS.B

sort merge join r s
Sort-Merge Join (R S)

i=j

  • Sort R and S on the join column, then scan them to do a ``merge’’ (on join col.), and output result tuples.
    • Advance scan of R until current R-tuple >= current S tuple, then advance scan of S until current S-tuple >= current R tuple; do this until current R tuple = current S tuple.
    • At this point, all R tuples with same value in Ri (current R group) and all S tuples with same value in Sj (current S group) match; output <r, s> for all pairs of such tuples.
    • Then resume scanning R and S.
  • R is scanned once; each S group is scanned once per matching R tuple. (Multiple scans of an S group are likely to find needed pages in buffer.)
example of sort merge join
Example of Sort-Merge Join
  • Cursors now are in sid=28 and will advance to sid=31. Before that, <28, yuppy, 9, 35.0, 103, 12/4/96…> and <28, yuppy, 9, 35.0, 103, 11/3/96…> are added to the output set.
  • Cost: M log M + N log N + (M+N)
hash join

Original

Relation

Partitions

OUTPUT

1

1

2

INPUT

2

hash

function

h

. . .

B-1

B-1

B main memory buffers

Disk

Disk

Hash-Join
  • Partition both relations using hash fn h: R tuples in partition i will only match S tuples in partition i.
  • Then, check in each bucket for the matching tuples. Note that only tuples of R and S in the same bucket must be checked.
cost of hash join
Cost of Hash-Join
  • In partitioning phase, read+write both relns; 2(BR+BS). In matching phase, read both relns; BR+BS I/Os.
  • Sort-Merge Join vs. Hash Join:
    • Given a minimum amount of memory both have a cost of 3(BR+BS) I/Os. Hash Join superior on this count if relation sizes differ greatly. Also, Hash Join shown to be highly parallelizable.
    • Sort-Merge less sensitive to data skew; result is sorted.
general join conditions
General Join Conditions
  • Equalities over several attributes (e.g., R.B=S.B ANDR.C=S.C):
    • For Index Nested Loops, build index on <B, C> (if S is inner); or use existing indexes on B or C.
    • For Sort-Merge and Hash Join, sort/partition on combination of the two join columns.
  • Inequality conditions (e.g., R.B < S.B):
    • For Index NL, need (clustered!) B+ tree index.
      • Range probes on inner; # matches likely to be much higher than for equality joins.
    • Hash Join, Sort Merge Join not applicable.
    • Block NL quite likely to be the best join method here.
set operations
Set Operations
  • Intersection special case of join.
  • Union (Distinct) and Except similar; we’ll do union.
  • Sorting based approach to union:
    • Sort both relations (on combination of all attributes).
    • Scan sorted relations and merge them.
  • Hash based approach to union:
    • Partition R and S using hash function h.
    • For each S-partition, build in-memory hash table, scan corr. R-partition and add tuples to table while discarding duplicates.
aggregate operations avg min etc
Aggregate Operations (AVG, MIN, etc.)
  • Without grouping:
    • In general, requires scanning the relation.
    • Given index whose search key includes all attributes in the SELECT or WHERE clauses, can do index-only scan.
  • With grouping:
    • Sort on group-by attributes, then scan relation and compute aggregate for each group. (Can improve upon this by combining sorting and aggregate computation.)
    • Similar approach based on hashing on group-by attributes.
    • Given tree index whose search key includes all attributes in SELECT, WHERE and GROUP BY clauses, can do index-only scan; if group-by attributes form prefix of search key, can retrieve data entries/tuples in group-by order.
ad