1 / 12

Query Execution 15.5 Two-pass Algorithms based on Hashing

Query Execution 15.5 Two-pass Algorithms based on Hashing. By Swathi Vegesna. At a glimpse. Introduction Partitioning Relations by Hashing Algorithm for Duplicate Elimination Grouping and Aggregation Union, Intersection, and Difference Hash-Join Algorithm Sort based Vs Hash based

Download Presentation

Query Execution 15.5 Two-pass Algorithms based on Hashing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Execution15.5 Two-pass Algorithms based on Hashing By Swathi Vegesna

  2. At a glimpse • Introduction • Partitioning Relations byHashing • Algorithm for Duplicate Elimination • Grouping and Aggregation • Union, Intersection, and Difference • Hash-Join Algorithm • Sort based Vs Hash based • Summary

  3. Introduction Hashing is done if the data is too big to store in main memory buffers. • Hash all the tuples of the argument(s) using an appropriate hash key. • For all the common operations, there is a way to select the hash key so all the tuples that need to be considered together when we perform the operation have the same hash value. • This reduces the size of the operand(s) by a factor equal to the number of buckets.

  4. Partitioning Relations byHashing Algorithm: initialize M-1 buckets using M-1empty buffers; FOR each block b of relation RDO BEGIN read block b into the Mth buffer; FOR each tuple t in b DO BEGIN IF the buffer for bucket h(t) has no room for t THEN BEGIN copy the buffer t o disk; initialize a new empty block in that buffer; END; copy t to the buffer for bucket h(t); END ; END ; FOR each bucket DO IF the buffer for this bucket is not empty THEN write the buffer to disk;

  5. Duplicate Elimination • For the operation δ(R) hash R to M-1 Buckets. (Note that two copies of the same tuple t will hash to the same bucket) • Do duplicate elimination on each bucket Riindependently, using one-pass algorithm • The result is the union of δ(Ri), where Ri is the portion of R that hashes to the ith bucket

  6. Requirements • Number of disk I/O's: 3*B(R) • B(R) < M(M-1), only then the two-pass, hash-based algorithm will work • In order for this to work, we need: • hash function h evenly distributes the tuples among the buckets • each bucket Rifits in main memory (to allow the one-pass algorithm) • i.e., B(R) ≤ M2

  7. Grouping and Aggregation • Hash all the tuples of relation R to M-1 buckets, using a hash function that depends only on the grouping attributes (Note: all tuples in the same group end up in the same bucket) • Use the one-pass algorithm to process each bucket independently • Uses 3*B(R) disk I/O's, requires B(R) ≤ M2

  8. Union, Intersection, and Difference • For binary operation we use the same hash function to hash tuples of both arguments. • R U S we hash both R and S to M-1 • R ∩ S we hash both R and S to 2(M-1) • R-S we hash both R and S to 2(M-1) • Requires 3(B(R)+B(S)) disk I/O’s. • Two pass hash based algorithm requires min(B(R)+B(S))≤ M2

  9. Hash-Join Algorithm • Use same hash function for both relations; hash function should depend only on the join attributes • Hash R to M-1 buckets R1, R2, …, RM-1 • Hash S to M-1 buckets S1, S2, …, SM-1 • Do one-pass join of Riand Si, for all i • 3*(B(R) + B(S)) disk I/O's; min(B(R),B(S)) ≤ M2

  10. Sort based Vs Hash based • For binary operations, hash-based only limits size to min of arguments, not sum • Sort-based can produce output in sorted order, which can be helpful • Hash-based depends on buckets being of equal size • Sort-based algorithms can experience reduced rotational latency or seek time

  11. Summary • Partitioning Relations byHashing • Algorithm for Duplicate Elimination • Grouping and Aggregation • Union, Intersection, and Difference • Hash-Join Algorithm • Sort based Vs Hash based

  12. Thank you

More Related