1 / 33

Yet More on Indexes

Yet More on Indexes. Hash Tables. Source: our textbook, slides by Hector Garcia-Molina. Main Memory Hash Tables. A hash function h maps search keys to integers in some range 0 to B-1 B is the number of buckets

tom
Download Presentation

Yet More on Indexes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Yet More on Indexes Hash Tables Source: our textbook, slides by Hector Garcia-Molina

  2. Main Memory Hash Tables • A hash functionh maps search keys to integers in some range 0 to B-1 • B is the number of buckets • There is a B-element array, each entry holds a pointer to a linked list • Record with key k is put in the linked list that starts at entry h(k) of B.

  3. Example of Hash Table 15 0 10 B = 5 1 h(k) = k mod 5 2 22 3 4 104 29 34

  4. Changes for Secondary Storage • Bucket array contains blocks, not pointers to linked lists • Records that hash to a certain bucket are put in the corresponding block • If a bucket overflows then start a chain of overflow blocks

  5. Insertion into Static Hash Table • To insert a record with key K: • compute h(K) • insert record into one of the blocks in the chain of blocks for bucket number h(K), adding a new block to the chain if necessary

  6. d a e c b EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 0 1 2 3 h(e) = 1

  7. Deletion from a Static Hash Table • To delete records with key K: • Go to the bucket numbered h(K) • Search for records with key K, deleting any that are found • Possibly condense the chain of overflow blocks for that bucket

  8. d maybe move “g” up EXAMPLE: deletion Delete:ef 0 1 2 3 a b d c c e f g

  9. If < 50%, wasting space • If > 80%, overflows significant depends on how good hash function is & on # records/bucket Rule of thumb: • Try to keep space utilization between 50% and 80% Utilization = # record used total # records that fit

  10. Efficiency of Static Hash Tables • If the hash table size is large enough and the distribution of keys by the hash function is sufficiently "even", then most buckets have no overflow blocks • In this case lookup typically takes one disk I/O and insertion/deletion take two • Significantly better than sequential indexes and B-trees • (But: hash tables do not support efficient range queries as B-trees do) • What if there are long overflow blocks?

  11. Extensible • Linear How do we cope with growth? • Overflows and reorganizations • Dynamic hashing

  12. Extensible Hash Tables • Each bucket in the bucket array contains a pointer to a block, instead of a block itself • Bucket array can grow by doubling in size • Certain buckets can share a block if small enough • hash function computes a sequence of k bits, but only first i bits are used at any time to index into the bucket array • Value of i can increase (corresponds to bucket array doubling in size)

  13. (b) Use directory h(K)[i ] to bucket . . . . . .

  14. Inserting into Extensible Hash Table • To insert record with key K: • compute h(K) • go to bucket indexed by first i bits of h(K) • follow the pointer to get to block B • if room in B, insert record • else let j be number of bits of hash value used to determine membership in B

  15. Insertion cont'd • Case 1: j < i. • split block B in two • distribute records in B to the 2 new blocks based on value of their (j+1)-st bit • update header of each new block to j+1 • adjust pointers in bucket array so that entries that used to point to B now point to correct block • if still no room in appropriate block for new record then repeat this process

  16. Insertion cont'd • Case 2: j = i. • increment i by 1 • double length of bucket array • entry for w0 and w1 both point to same block that old entry w pointed to (block is shared) • apply case 1 to split block B

  17. i = 2 00 01 10 11 1 1 2 1010 New directory 2 1100 Example: h(k) is 4 bits; 2 keys/bucket 1 0001 i = 1 1001 1100 Insert 1010

  18. 2 0000 0001 2 0111 2 2 Example continued i = 2 00 01 10 11 1 0001 0111 1001 1010 Insert: 0111 0000 1100

  19. i = 3 000 001 010 011 100 101 110 111 3 1001 1001 2 1001 1010 3 1010 1100 2 Example continued 0000 2 0001 i = 2 00 01 10 11 0111 2 Insert: 1001

  20. Extensible hashing: deletion • No merging of blocks • Merge blocks and cut directory if possible (Reverse insert procedure)

  21. Indirection (Not bad if directory in memory) Directory doubles in size (Now it fits, now it does not) - - Summary Extensible hashing + Can handle growing files - with less wasted space - with no full reorganizations

  22. Linear Hash Tables • Number of buckets increases more slowly than with extensible hashing • Number of buckets is such that on average each block is x% full (say 80%) -- threshold • Overflow blocks can occur but average number per bucket << 1 • Use the i low-order bits from the result of the hash function to index into the bucket array

  23. Two ideas: b (a) Use ilow order bits of hash 01110101 grows i Linear hashing • Another dynamic hashing scheme (b) Bucket array grows linearly

  24. Inserting into Linear Hash Table • To insert record with key K, with last i bits of h(K) being a1a2…ai : • Let m be the integer represented by a1a2…ai in binary • If m < n (number of buckets), then bucket m exists -- put record in that bucket • If m ≥ n, then bucket m does not (yet) exist, so put record in bucket whose index corresponds to 0a2…ai

  25. Inserting cont'd • If no room in indicated bucket, then create an overflow bucket • Compare # records / # buckets to threshold • If exceeds threshold then add a new bucket and rearrange records • If number of buckets exceeds i, then increment i by 1

  26. 0101 • can have overflow chains! If h(k)[i ]  m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ]- 2i -1 Rule Exampleb=4 bits, i =2, 2 keys/bucket • insert 0101 00 01 10 11 Future growth buckets 0000 0101 1010 1111 m = 01 (max used block)

  27. 0101 • insert 0101 1010 1111 0101 10 11 Exampleb=4 bits, i =2, 2 keys/bucket 00 01 10 11 Future growth buckets 0000 0101 1010 1111 m = 01 (max used block)

  28. 3 0101 0101 101 100 0 0 0 0 100 101 110 111 100 101 Example Continued:How to grow beyond this? i = 2 00 01 10 11 0000 0101 1010 1111 0101 . . . m = 11 (max used block)

  29. Can still have overflow chains - Summary Linear Hashing + Can handle growing files - with less wasted space - with no full reorganizations No indirection like extensible hashing +

  30. Comparing Index Approaches • Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5

  31. Indexing vs Hashing • Sequential Indexes and B-trees good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5

  32. Index definition in SQL • Createindex name on rel (attr) • Createuniqueindex name on rel (attr) defines candidate key • Drop INDEX name

  33. Note CANNOT SPECIFY TYPE OF INDEX (e.g. B-tree, Hashing, …) OR PARAMETERS (e.g. Load Factor, Size of Hash,...) ... at least in SQL...

More Related