Hash-Based Indexes

Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke

Introduction • As for any index, 3 alternatives for data entries k*: • Data record with key value k • <k, rid of data record with search key value k> • <k, list of rids of data records with search key k> • Choice orthogonal to the indexing technique • Hash-based indexes are best for equalityselections. Cannot support range searches. • Static and dynamic hashing techniques exist; trade-offs similar to ISAM vs. B+ trees.

0 h(key) mod N 2 key h N-1 Primary bucket pages Overflow pages Static Hashing • h(k) mod N= bucket to which data entry withkey k belongs. k1k2 can lead to the same bucket. • Static: # buckets (N) fixed • main pages allocated sequentially, never de-allocated; • overflow pages if needed.

Static Hashing (Contd.) • Hash fn works on search key field of record r. Must distribute values over range 0 ... N-1. • h(key) mod N = (a * key + b) mod N usually works well. • a and b are constants; lots known about how to tune h. • Buckets contain data entries. • Long overflow chains can develop and degrade performance. • Extendible and LinearHashing: Dynamic techniques to fix this problem.

Extendible Hashing • When bucket (primary page) becomes full, why not re-organize file by doubling # of buckets? • Reading and writing all pages is expensive! • Idea: Use directory of pointers to buckets anddouble # of buckets by (1) doubling the directory,(2) splitting just the bucket that overflowed! • Directory much smaller than file, so doubling it is much cheaper. • Only one page of data entries is split. Nooverflowpage! • Trick lies in how hash function is adjusted!

LOCAL DEPTH L 2 Bucket A 16* 4* 12* 32* GLOBAL DEPTH D 2 2 Bucket B 00 5* 1* 21* 13* 01 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA Entry PAGES Example • Directory is array of size N=4, global depth D = 2. • To find bucket for key: • get h(key), • take last `global depth’ # bits of h(r), i.e., mod 2D. • If h(key) = 5 = binary 101, • Take last 2 bits, go to bucket pointed to by 01. • Each bucket has local depth L (L ≤ D) for splits!

LOCAL DEPTH L 2 Bucket A 16* 4* 12* 32* GLOBAL DEPTH D 2 2 Bucket B 00 5* 1* 21* 13* 01 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES Inserts • If bucket is full, splitit: • allocate new page, • re-distribute. • If necessary, double the directory. Decision made by comparing global depth D and local depth L for the split bucket. • Split if D = L. • Otherwise, don’t. Insert k* with h(k)=20?

3 LOCAL DEPTH 32* 16* Bucket A GLOBAL DEPTH 2 3 1* 5* 21* 13* 000 Bucket B 001 2 010 10* Bucket C 011 100 2 101 15* 7* 19* Bucket D 110 111 3 DIRECTORY 12* 20* Bucket A2 4* (`split image' of Bucket A) Insert h(k)=20 (Causes Doubling) 2 LOCAL DEPTH Bucket A (000) 16* 32* GLOBAL DEPTH 2 2 Bucket B 5* 21* 13* 1* 00 01 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* 2 Bucket A2 (100) 4* 12* 20* (`split image‘ of Bucket A)

Points to Note • 20 = binary 10100. Last 2 bits (00) tell that k belongs in A or A2. Last 3 bits needed to tell exactly which. • Global depth D of directory: Max # of bits needed to tell which bucket an entry belongs to. • Local depth L of a bucket: # of bits actually needed to determine if an entry belongs to this bucket. • When does bucket split cause directory doubling? • Before insert, L of bucket = D of directory. • Directory is doubled by copying it over and `fixing’ pointer to split image page. (Use of least significant bits enables efficient doubling via copying of directory!)

0001 2 0100 1100 2 0100 2 0001 1001 2 1001 1011 2 3 3 1011 2 1100 2 000 000 001 001 00 00 2 2  010 010 0100 1100 3 2 0001 01 01 0001 1001 2 011 011  10 10 2 0100 100 100 11 11 1011 2 101 101 1000 1001 3 110 110  111 111 1011 3 1000 3 2 1100 Directory Doubling (inserting 8*) Least Significant Most Significant

Deletes • Delete: removal of data entry from bucket • If bucket is empty, can be merged with `split image’. • If each directory element points to same bucket as its split image, can halve directory. • If assume more inserts than deletes, do nothing…

Comments on Extendible Hashing • Access cost: If directory fits in memory, equality search takes one I/O to access the bucket; else two. • 100MB file, 100 bytes/record, 4K pages • 1,000,000 records (treated as data entries) • 40 data entries per page, 25,000 pages/buckets • 25,000 elements (pointers) in directory; if 1K pointers per page, 25 pages for directory, likely to fit in memory! • Skews: If the distributionof hash valuesis skewed, directory can grow large. An example? • Duplicates: Entries with same key value need overflow pages!

Linear Hashing • Another dynamic hashing scheme, an alternative to Extendible Hashing. • Advantage: adjusts to inserts/deleteswithout using a directory. • Idea: Use a family of hash functions h0, h1, h2, ... • hi(key) = h(key) mod(2i N); N = initial # buckets • h is some hash function (range is not 0 to N-1) • h0 = h(key) mod N • hi+1 doubles the range of hi (similar to directory doubling) • If N = 2d0, for some d0, hi consists of applying h and looking at the last di bits, where di = d0 + i.

Linear Hashing (Contd.) • LH avoids directory by • using temporary overflow pages, and • choosing bucket to split round-robin. • Splitting proceeds in `rounds’. • Round L ends when all NL=2LN initial buckets are split. • Nextbucket to be split. Buckets 0 toNext-1 have been split; Next to NL yet to be split. • Search:To find bucket for data entry k*, findhL(k): • If hL(k) in range `Next to NL’, k* belongs here. • Else, k* could belong to bucket hL(r) or bucket hL(k) + NL; must apply hL+1(r) to find out.

Buckets split in this round Bucket to be splitNext 2L N buckets existed at the beginning of this round. This is the range of hL. `Split image' buckets: created (through splitting of other buckets) in this round LH File and Searches PRIMARY PAGES • In the Lth Round of splitting: Search: If hL(key) is in the split range, must use hL+1to decide if entry is in a split image bucket.

Inserts • Insert: Find bucket B by applying hL / hL+1: • If bucket B to insert into is full: • Add overflow page and insert data entry k*. • (Maybe) Split bucketNext(often B Next) using hL+1and increment Next. • Can choose any criterion to `trigger’ split. • Since buckets are split round-robin, long overflow chains don’t develop! • Extendible Hashing in comparison: • Doubling of directory is similar. • Switching of hash functions is implicit in how the # of bits examined is increased. No need to physically double the directory!

Insert 43* Insert 43* L=0, h0 =22 / h1 =23 OVERFLOW PAGES PRIMARY PAGES h h h 0 0 1 Next=0 32* 44* 36* 32* 00 000 00 Next=1 9* 5* 25* 9* 5* 25* 001 01 01 30* 10* 18* 14* 30* 10* 14* 18* Primary 10 10 010 bucket page 31* 35* 7* Data entry k* with h(k)=11 31* 35* 7* 11* 11* 31* 43* 011 11 11 (This is for illustration only!) (Actual contents of a linear hashed file) 100 44* 36* 00 Example of Linear Hashing • On split, hL+1 is used to re-distribute entries. L=0, N=4, h0 =22 PRIMARY PAGES

Insert 50* Round L=1, h1=23 Round L=0, h0=22, h1=23 PRIMARY OVERFLOW h h PAGES 0 1 PAGES PRIMARY OVERFLOW Next=0 h h 1 0 PAGES PAGES 00 000 32* 32* 000 00 001 01 9* 25* 9* 25* 001 01 10 010 50* 10* 18* 66* 34* 66* 10 18* 10* 34* 010 Next=3 011 11 43* 11* 7* 35* 31* 35* 11* 43* 011 11 44* 36* 100 100 00 00 44* 36* 5* 37* 29* 101 01 101 11 5* 29* 37* 14* 30* 22* 110 10 14* 22* 30* 110 10 31* 7* 11 111 Example: End of a Round

LH Described as a Variant of EH • The two schemes are actually quite similar: • Begin with an EH index where directory has N elements. • Use overflow pages, split buckets round-robin. • First split is at bucket 0. (Imagine directory being doubled at this point.) But elements (1,N+1), (2,N+2), ... are the same. So, need only create directory element N, which differs from 0, now. • When bucket 1 splits, create directory element N+1, etc. • Directory can double gradually. • Primary bucket pages are created in order. If they are allocated in sequence too (so that finding i’th is easy), we actually don’t need a directory!

Summary • Hash-based indexes: best for equality searches, cannot support range searches. • Static Hashing can lead to long overflow chains. • Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. (But duplicates may require overflow pages.) • Directory to keep track of buckets, doubles periodically. • Can get large with skewed data; additional I/O if this does not fit in main memory.

Summary (Contd.) • Linear Hashing avoids directory by splitting buckets round-robin, and using overflow pages. • Overflow pages not likely to be long. • Duplicates handled easily. • Space utilization could be lower than Extendible Hashing, since splits not concentrated on `dense’ data areas. • Especially true with skewed data. • Can tune criterion for triggering splits to trade-off slightly longer chains for better space utilization. • For hash indexes, a skewed data distribution means hash valuesof data entries are not uniformly distributed! Cause problems with EH and LH.

Hash-Based Indexes