Hashing Techniques

Hashing Techniques 1 1 1 1 1

Overview • Hash[ “string key”] ==> integer value • Hash Table Data Structure : Use-case • To support insertion, deletion and search in average-case constant time • Assumption: Order of elements irrelevant • ==> data structure not useful for sorting • Hash table ADT • Implementations • Analysis • Applications 2 2 2 2

key value Hash index h(“john”) “john” TableSize key Hash function How to determine … ? Hash table: Main components Hash table(implemented as a vector)

Hash Table Hash function • Insert • T [h(“john”)] = T[3] = <“john”,25000> • Delete • T [h(“john”)] = NULL • Search • Return T [h(“john”)] • What if h(“john”) = h(“joe”) ? • “collision” Hash key Data record

Factors affecting Hash Table Design • Hash function • Table size • Usually fixed at the start • Collision handling scheme

h(“key”) ==> hash table index Hash Function Properties • A hash function maps key to integer • Constraint: Integer should be between [0, TableSize-1] • A hash function can be a many-to-one mapping (causing collision) • Collision occurs when hash function maps two or more keys to same array index • Collisions cannot be avoided but its chances can be reduced using a good hash function

Hash Function Properties h(“key”) ==> hash table index 8 • A good hash function should have the properties: • Different keys should ideally map to different indices • Hash function should ideally distribute keys uniformly over table • Minimize chance of collisions

Hash Function - Effective use of table size • Simple hash function (assume integer keys) • h(Key) = Key mod TableSize • For random keys, h() distributes keys evenly over table • What if TableSize = 100 and keys are multiples of 10? • Better if TableSize is a prime number • Not too close to powers of 2 or 10

Hash Function for String Keys A very simple function to map strings to integers: • Add up character ASCII values (0-127) to produce integer keys • E.g., “abcd” = 97+98+99+100 = 394 • ==> h(“abcd”) = 394 % TableSize • Small strings may not use all of table • Strlen(S) * 127 < TableSize • Anagrams will map to the same index • h(“abcd”) == h(“dbac”)

Hash Function for String Keys Apple Apply Appointment Apricot collision 11 • Approach 2 • Treat first 3 characters of string as base-27 integer (26 letters plus space) • Key = S[0] + (27 * S[1]) + (272 * S[2]) • Assumes first 3 characters randomly distributed • Not true of English

Hash Function for String Keys • Approach 3 • Use all N characters of string as an N-digit base-K integer • Choose K to be prime number larger than number of different digits (characters) • I.e., K = 29, 31, 37 • If L = length of string S, then • Use Horner’s rule to compute h(S) • Limit L for long strings Problem: A very long string can evaluate to a very large number

Resolving Collisions • What happens when h(k1) = h(k2)? • ==> collision ! • Collision resolution strategies • Separate Chaining (Open hashing) • Store colliding keys in a linked list at the same hash table index • Open addressing (Closed hashing) • Store colliding keys elsewhere in the table

Separate Chaining Collision resolution approach #1

Collision Resolution by Chaining • Hash table T is a vector of linked lists • Only singly-linked lists needed if memory is tight • Key k is stored in list at T[h(k)] • E.g., TableSize = 10 • h(k) = k mod 10 • Insert first 10 perfect squares Insertion sequence: { 0 1 4 9 16 25 36 49 64 81 }

Collision Resolution by Chaining: Analysis • Load factorλ of a hash table T • N = number of elements in T • M = size of T • λ = N/M • I.e., λ is the average length of a chain • Unsuccessful search O(λ) • Same for insert • Successful search O(λ/2) • Ideally, want λ ≤ 1 (not a function of N)

Open Addressing Collision resolution approach #2

Collision Resolution byOpen Addressing When a collision occurs, look elsewhere in the table for an empty slot • Advantages over chaining • No need for list structures • No need to allocate/deallocate memory during insertion/deletion (slow) • Disadvantages • Slower insertion – May need several attempts to find an empty slot • Table needs to be bigger (than chaining-based table) to achieve average-case constant-time performance • Load factor λ ≈ 0.5

Collision Resolution byOpen Addressing • Probe sequence • Sequence of slots in hash table to search • h0(x), h1(x), h2(x), … • Needs to visit each slot exactly once • Needs to be repeatable (so we can find/delete what we’ve inserted) • Hash function • hi(x) = (h(x) + f(i)) mod TableSize • f(0) = 0 ==> first try

ith probe index = first probe index Linear Probing + i • f(i) = is a linear function of i, e.g., f(i) = i • hi(x) = (h(x) + i) mod TableSize • Probe sequence: +0, +1, +2, +3, +4, … • Example: h(x) = x mod TableSize • h0(89) = (h(89)+f(0)) mod 10 = 9 • h0(18) = (h(18)+f(0)) mod 10 = 8 • h0(49) = (h(49)+f(0)) mod 10 = 9 (X) • h1(49) = (h(49)+f(1)) mod 10 = (h(49)+ 1 ) mod 10 = 0

7 total Linear Probing Example Insert sequence: 89, 18, 49, 58, 69 time #unsuccessful probes: 0 0 1 3 3

Linear Probing: Issues • Probe sequences can get long • Primary clustering • Keys tend to cluster in one part of table • Keys that hash into cluster will be added to the end of the cluster (making it even bigger)

Quadratic Probing • Avoids primary clustering • f(i) is quadratic in I, e.g., f(i) = i2 • hi(x) = (h(x) + i2) mod TableSize • Probe sequence: +0, +1, +4, +9, +16, … • Example: • h0(58) = (h(58)+f(0)) mod 10 = 8 (X) • h1(58) = (h(58)+f(1)) mod 10 = 9 (X) • h2(58) = (h(58)+f(2)) mod 10 = 2

+12 +22 +22 +02 +02 +02 +02 +12 +02 +12 5 total Quadratic Probing Example Q) Delete(49), Find(69) - is there a problem? Insert sequence: 89, 18, 49, 58, 69 #unsuccessful probes: 1 2 2 0 0

Quadratic Probing • May cause “secondary clustering” • Deletion • Emptying slots can break probe sequence and could cause find stop prematurely • Lazy deletion • Differentiate between empty and deleted slot • Skip deleted slots • Slows operations (effectively increases λ)

Double Hashing • Use a second hash function for tries • f(i) = i * h2(x) • Good choices for h2(x) ? • Should never evaluate to 0 • h2(x) = R – (x mod R) • R is prime number less than TableSize • Previous example with R=7 • h0(49) = (h(49)+f(0)) mod 10 = 9 (X) • h1(49) = (h(49)+1*(7 – 49 mod 7)) mod 10 = 6 26 f(1)

Double Hashing Example

1st try 1st try 2nd try 2nd try 2nd try 3rd try … 3rd try 1st try 3rd try … Probing Techniques - review Linear probing: Quadratic probing: Double hashing*: 0th try 0th try 0th try i i i … *(determined by a second hash function)

Rehashing • Increases the size of the hash table when load factor too high • Typically expand the table to twice its size (but still prime) • Need to reinsert all existing elements into new hash table • When to rehash • When table is half full (λ = 0.5) • When an insertion fails • When load factor reaches some threshold

h(x) = x mod 17 λ = 0.29 Rehashing Insert 23 λ = 0.71 Rehashing Example h(x) = x mod 7 λ = 0.57

Hash Table Applications • Symbol table in compilers • Accessing tree or graph nodes by name • E.g., city names in Google maps • Maintaining a transposition table in games • Remember previous game situations and the move taken (avoid re-computation) • Dictionary lookups • Spelling checkers • Natural language understanding (word sense) • Heavily used in text processing languages • E.g., Perl, Python, etc.

Summary • Hash tables support fast insert and search • O(1) average case performance • Deletion possible, but degrades performance • Not suited if ordering of elements is important • Many applications

Hashing Problem • Draw the 11 entry hashtable for hashing the keys 12, 44, 13, 88, 23, 94, 11, 39, 20 using the function (2i+5) mod 11, closed hashing, linear probing • List all identifiers in a hashtable in lexicographic order, using open hashing, the hash function h(x) = first character of x. What is the running time?

End of Hashing

Hashing Analysis

Random Probing: Analysis • Random probing does not suffer from clustering • Expected number of probes for insertion or unsuccessful search: • Example • λ = 0.5: 1.4 probes • λ = 0.9: 2.6 probes

bad good Linear vs. Random Probing Linear probing Random probing # probes Load factor λ U - unsuccessful search S - successful search I - insert

Quadratic Probing: Analysis • Difficult to analyze • Theorem 5.1 • New element can always be inserted into a table that is at least half empty and TableSize is prime • Otherwise, may never find an empty slot, even is one exists • Ensure table never gets half full • If close, then expand it

Double Hashing: Analysis • Imperative that TableSize is prime • E.g., insert 23 into previous table • Empirical tests show double hashing close to random hashing • Extra hash function takes extra time to compute

Rehashing Analysis • Rehashing takes O(N) time • But happens infrequently • Specifically • Must have been N/2 insertions since last rehash • Amortizing the O(N) cost over the N/2 prior insertions yields only constant additional time per insertion

Problem with Large Tables • What if hash table is too large to store in main memory? • Solution: Store hash table on disk • Minimize disk accesses • But… • Collisions require disk accesses • Rehashing requires a lot of disk accesses Solution: Extendible Hashing

Extendible Hashing • Extendible hashing is a type of hash system which treats a hash as a bit string, and uses a prefix for bucket lookup. Because of the hierarchal nature of the system, re-hashing is an incremental operation (done one bucket at a time, as needed). • Place the following keys in the hash table k1 = 100100, k2 = 010110, k3 = 110110, k4 = 011110 • Let's assume that for this particular example, the bucket size is 1. The first two keys to be inserted, k1 and k2, can be distinguished by the most significant bit, and would be inserted into the table as follows:

Extendible Hashing • Now, if k3 were to be hashed to the table, it wouldn't be enough to distinguish all three keys by one bit (because k3 and k1 have 1 as their leftmost bit. Also, because the bucket size is one, the table would overflow. Because comparing the first two most significant bits would give each key a unique location, the directory size is doubled as follows:

Extendible Hashing • Now, k4 needs to be inserted, and it has the first two bits as 01..(1110), and using a 2 bit depth in the directory, this maps from 01 to Bucket A. Bucket A is full (max size 1), so it must be split; because there is more than one pointer to Bucket A, there is no need to increase the directory size. • What is needed is, the information about: • The key size that maps the directory (the global depth), and • The key size that has previously mapped the bucket (the local depth) • In order to distinguish the two action cases: • Doubling the directory when a bucket becomes full • Creating a new bucket, and re-distributing the entries between the old and the new bucket • Back to the two action cases: • If the local depth is equal to the global depth, then there is only one pointer to the bucket, and there is no other directory pointers that can map to the bucket, so the directory must be doubled (case1). • If the bucket is full, if the local depth is less than the global depth, then there exists more than one pointer from the directory to the bucket, and the bucket can be split (case 2).

Extendible Hashing • Key 01 points to Bucket A, and Bucket A's local depth of 1 is less than the directory's global depth of 2, which means keys hashed to Bucket A have only used a 1 bit prefix (i.e. 0), and the bucket needs to have its contents split using keys 1 + 1 = 2 bits in length; in general, for any local depth d where d is less than D, the global depth, then d must be incremented after a bucket split, and the new d used as the number of bits of each entry's key to redistribute the entries of the former bucket into the new buckets.

Extendible Hashing Now, h(k4) = 011110is tried again, with 2 bits 01.., and now key 01 points to a new bucket but there is still k2 in it (h(k2) = 010110 and also begins with 01). So Bucket D needs to be split, but a check of its local depth, which is 2, is the same as the global depth, which is 2, so the directory must be split again, in order to hold keys of sufficient detail, e.g. 3 bits.

Extendible Hashing Bucket D needs to split due to being full. As D's local depth = the global depth, the directory must double to increase bit detail of keys. Global depth has incremented after directory split to 3. The new entry k4 is rekeyed with global depth 3 bits and ends up in D which has local depth 2, which can now be incremented to 3 and D can be split to D' and E. The contents of the split bucket D, k2, has been re-keyed with 3 bits, and it ends up in D. K4 is retried and it ends up in E which has a spare slot.

Extendible Hashing Now, h(k2) = 010110 is in D and h(k4) = 011110 is tried again, with 3 bits 011.., and it points to bucket D which already contains k2 so is full; D's local depth is 2 but now the global depth is 3 after the directory doubling, so now D can be split into bucket's D' and E, the contents of D , k2 has its h(k2) retried with a new global depth bitmask of 3 and k2 ends up in D', then the new entry k4 is retried with h(k4) bitmasked using the new global depth bit count of 3 and this gives 011 which now points to a new bucket E which is empty. So K4 goes in Bucket E.

Hashing Techniques