1 / 29

B+-tree and Hashing

B+-tree and Hashing. Model of Computation. CPU. Memory. Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records N records -> N/B = n pages I/O complexity: in number of pages. I/O complexity.

kermit
Download Presentation

B+-tree and Hashing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. B+-tree and Hashing

  2. Model of Computation CPU Memory • Data stored on disk(s) • Minimum transfer unit: a page = b bytes or B records • N records -> N/B = n pages • I/O complexity: in number of pages

  3. I/O complexity • An ideal index has space O(N/B), update overhead O(logB(N/B)) and search complexity O(logB(N/B) + a/B) where a is the number of records in the answer • But, sometimes CPU performance is also important… minimize cache faults -> don’t waste CPU cycles

  4. B+-tree • Records can be ordered over an attribute, SSN, Name, etc. • Queries: exact match and range queries over the indexed attribute: “find the person with SSN=087-34-7892” or “find all students with gpa between 3.00 and 3.5”

  5. B+-tree:properties • Insert/delete at log F (N/B) cost; keep tree height-balanced. (F = fanout) • Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree. (n=2d) • Two types of nodes: index nodes and data nodes; each node is 1 page (disk based method)

  6. Example 100 Root 120 150 180 30 3 5 11 120 130 180 200 100 101 110 150 156 179 30 35

  7. Index node 57 81 95 to keys to keys to keys to keys < 57 57£ k<81 81£k<95 95£

  8. Data node From non-leaf node to next leaf in sequence 57 81 95 To record with key 57 To record with key 81 To record with key 85

  9. Insertion • Find correct leaf L. • Put data entry onto L. • If L has enough space, done! • Else, must splitL (into L and a new node L2) • Redistribute entries evenly, copy upmiddle key. • Insert index entry pointing to L2 into parent of L. • This can happen recursively • To split index node, redistribute entries evenly, but push upmiddle key. (Contrast with leaf splits.) • Splits “grow” tree; root split increases height. • Tree growth: gets wider or one level taller at top.

  10. Deletion • Start at root, find leaf L where entry belongs. • Remove the entry. • If L is at least half-full, done! • If L has only d-1 entries, • Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). • If re-distribution fails, mergeL and sibling. • If merge occurred, must delete entry (pointing to L or sibling) from parent of L. • Merge could propagate to root, decreasing height.

  11. Characteristics • Optimal method for 1-d range queries: Space linear, update and query logarithmic in #I/Os. • Space utilization: 67% for random input • Original B-tree: index nodes store also pointers to records

  12. Other issues • Prefix compression • In index nodes store only the prefix that differentiate consecutive sub-trees. Fanout is increased. • Cache sensitive B+-tree • Place keys in a way that reduces the cache faults during the binary search in each node.

  13. Hashing • Hash-based indices are best for exact match queries. Faster than B+-tree! • Typically 1-2 I/Os per query where a B+-tree requires 4-5 I/Os • But, cannot answer range queries…

  14. Idea • Use a function to direct a record to a page • h(k) mod M = bucket to which data entry withkey k belongs. (M = # of buckets) 0 h(key) mod N 2 key h M-1 Primary bucket pages

  15. Design decisions • Function: division or multiplication h(x) = (a*x+b) mod M, h(x) = [ fractional-part-of ( x * φ ) ] * M, φ: golden ratio ( 0.618... = ( sqrt(5)-1)/2 ) • Size of hash table M • Overflow handling: open addressing or chaining : problem in dynamic databases

  16. Dynamic hashing schemes • Extendible hashing: uses a directory that grows or shrinks depending on the data distribution. No overflow buckets • Linear hashing: No directory. Splits buckets in linear order, uses overflow buckets

  17. Extendible Hashing • Bucket (primary page) becomes full. Why not re-organize file by doubling # of buckets? • Reading and writing all pages is expensive! • Idea: Use directory of pointers to buckets, double # of buckets by doubling the directory, splitting just the bucket that overflowed! • Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. • Trick lies in how hash function is adjusted!

  18. Insert h(k) = 20 10100 00 2 3 LOCAL DEPTH LOCAL DEPTH Bucket A 16* 32* 32* 16* GLOBAL DEPTH Bucket A GLOBAL DEPTH 2 2 2 3 Bucket B …00 5* 21* 13* 1* 1* 5* 21* 13* …000 Bucket B …01 …001 2 …10 Bucket C 2 …010 10* …11 …011 10* Bucket C …100 2 2 …101 DIRECTORY Bucket D 15* 7* 19* 15* 7* 19* Bucket D …110 …111 2 3 Bucket A2 4* 12* 20* DIRECTORY 12* 20* Bucket A2 4* (`split image' of Bucket A) (`split image' of Bucket A)

  19. Linear Hashing • This is another dynamic hashing scheme, an alternative to Extendible Hashing. • Motivation: Ext. Hashing uses a directory that grows by doubling… Can we do better? (smoother growth) • LH: split buckets from left to right, regardless of which one overflowed (simple, but it works!!)

  20. Linear Hashing (Contd.) • Directory avoided in LH by using overflow pages. (chaining approach) • Splitting proceeds in `rounds’. Round ends when all NRinitial (for round R) buckets are split. Buckets 0 to Next-1 have been split; Next to NR yet to be split. • Current round number is Level. • Search:To find bucket for data entry r, findhLevel(r): • If hLevel(r) in range `Next to NR’, r belongs here. • Else, r could belong to bucket hLevel(r) or bucket hLevel(r) + NR; must apply hLevel+1(r) to find out.

  21. Linear Hashing: Example Initially: h(x) = x mod N (N=4 here) Assume 3 records/bucket Insert 17 = 17 mod 4 1 Bucket id 0 1 2 3 4 85 9 6 7 11 13

  22. Linear Hashing: Example Initially: h(x) = x mod N (N=4 here) Assume 3 records/bucket Insert 17 = 17 mod 4 1 Bucket id 0 1 2 3 4 85 9 6 7 11 Overflow for Bucket 1 13 Split bucket 0, anyway!!

  23. Linear Hashing: Example To split bucket 0, use another function h1(x): h0(x) = x mod N , h1(x) = x mod (2*N) 17 0 1 2 3 4 85 9 6 7 11 Split pointer 13

  24. Linear Hashing: Example To split bucket 0, use another function h1(x): h0(x) = x mod N , h1(x) = x mod (2*N) 17 Bucket id 0 1 2 3 4 8 5 9 6 7 11 4 Split pointer 13

  25. Linear Hashing: Example To split bucket 0, use another function h1(x): h0(x) = x mod N , h1(x) = x mod (2*N) Bucket id 0 1 2 3 4 8 5 9 6 7 11 4 13 17

  26. Linear Hashing: Example h0(x) = x mod N , h1(x) = x mod (2*N) Insert 15 and 3 Bucket id 0 1 2 3 4 8 5 9 6 7 11 4 13 17

  27. Linear Hashing: Example h0(x) = x mod N , h1(x) = x mod (2*N) Bucket id 0 1 2 3 4 5 8 9 6 7 11 4 13 5 17 15 3

  28. Linear Hashing: Search h0(x) = x mod N (for the un-split buckets) h1(x) = x mod (2*N) (for the split ones) Bucket id 0 1 2 3 4 5 8 9 6 7 11 4 13 5 17 15 3

  29. Linear Hashing: Search Algorithm for Search: Search(k) 1 b = h0(k) 2 if b < split-pointer then 3 b = h1(k) 4 read bucket b and search there

More Related