220 likes | 498 Views
Hash Functions. Andy Wang Data Structures, Algorithms, and Generic Programming. Introduction. Hash function Maps keys to integers (buckets) Hash(Key) = Integer Ideally in a random-like manner Evenly distributed bucket values Even if the input data is not evenly distributed. An Example.
E N D
Hash Functions Andy Wang Data Structures, Algorithms, and Generic Programming
Introduction • Hash function • Maps keys to integers (buckets) • Hash(Key) = Integer • Ideally in a random-like manner • Evenly distributed bucket values • Even if the input data is not evenly distributed
An Example • ID Number Generation • Key = your name • Hash(Key) = a number • Not a great hash function… • Two people with the same name will have the same number…
Simple Hash Functions • Assumptions: • K: an unsigned 32-bit integer • M: the number of buckets (the number of entries in a hash table) • Goal: • If a bit is changed in K, all bits are equally likely to change for Hash(K)
A Simple Hash Function… • What if K = M? • Hash(K) = K • What is wrong? • Your student ID = SSN • I can’t use your SSN to post your grades…
Another Simple Function • If K > M • Hash(K) = K % M • What is wrong? • Suppose M = 4, K = 2, 4, 6, 8 • K % M = 2, 0, 2, 0
Yet Another Simple Function • If K > P, P = prime number • Hash(K) = K % P • Suppose P = 3, K = 2, 4, 6, 8 • K % P = 2, 1, 0, 3 • More uniform distribution…but still problematic for other cases
More on Prime Numbers • K > P1 > P2, P1 and P2 are prime numbers • Hash(K) = (K % P1) % P2 • Suppose P1 = 5, P2 = 3, K = 2, 4, 6, 8, 10 • (K % 5) = 2, 4, 1, 3, 0 • (K % 5) % 3 = 2, 1, 1, 0, 0 • Still uniform distribution
Polynomial Functions • If K > P, P = prime number • Hash(K) = K(K + 3) % P • Slightly better than pure modulo functions
How About… • Hash(K) = rand() • What is wrong? • Not repeatable
How About… • K > P, P = prime number • Hash(K) = rand(K) % P • Better randomness • Can be expensive to compute random numbers
Pre-generated Randomness • Two prime numbers: P1 and P2 • K > P1 and K > P2 • A table R[P1], with R[i] pre-initialized to rand(i) % P2 • Hash(K) = R[K % P1] • Slight Problem: Possible duplicate mapping
To Avoid Duplicate Mapping… • Two prime numbers: P1 and P2 • K > P1 and K > P2 • A table R[P1], with R[i] pre-initialized to unique random numbers • Hash(K) = R[K % P1]
An Example • K = 0…232, P1 = 3, P2 = 5 • R[3] = {0, 4, 1} • Hash(K) = R[K % 3]
Hashing a Sequence of Keys • K = {K1, K2, …, Kn) • E.g., Hash(“test”) = 98157 • Design Principles • Use the entire key • Use the ordering information • Use pre-generated randomness
Use the Entire Key unsigned int Hash(const char *Key) { unsigned int hash = 0; for (unsigned int j = 0; j < K; j++) { hash = hash ^ Key[j] } return hash; } • Problem: Hash(“ab”) == Hash(“ba”)
Use the Ordering Information unsigned int Hash(const char *Key) { unsigned int hash = 0; for (unsigned int j = 0; j < K; j++) { hash = hash ^ Key[j] hash = /* hash with some shiftings */ } return hash; } • Problem: H(short keys) will not perturb all 32-bits (clustering)
Use Pre-generated Randomness unsigned int Hash(const char *Key) { unsigned int hash = 0; for (unsigned int j = 0; j < K; j++) { hash = hash ^ R[Key[j]] hash = /* hash with some shiftings */ } return hash; }
CRC Variant • Do 5-bit circular shift of hash • XOR hash and K[j] … for (…) { highorder = hash & 0xf8000000; hash = hash << 5; hash = hash ^ (highorder >> 27) hash = hash ^ K[j]; } …
CRC Variant + For long keys, all 32-bits are exercised + More randomness toward lower bits - Not all bits are changed for short keys
BUZ Hash • Set up an array R to store precomputed random numbers … for (…) { highorder = hash & 0x80000000; hash = hash << 1; hash = hash ^ (highorder >> 31) hash = hash ^ R[K[j]]; } …
References • Aho, Sethi, and Ullman. Compilers: Principles, Techniques, and Tools, 1986. • Cormen, Leiserson, River. Introduction to Algorithms, 1990 • Knuth. The Art of Computer Programming, 1973 • Kuenning. Hash Functions, 2003.