HashTable

HashTable

Dictionary • A collection of data that is accessed by “key” values • The keys may be ordered or unordered • Multiple key values may/may-not be allowed • Supports the following fundamental methods • void put(Object key, Object data) • Inserts data into the dictionary using the specified key • Object get(Object key) • Returns the data associated with the specified key • An error occurs if the specified key is not in the dictionary • Object remove(Object key) • Removes the data associated with the specified key and returns the data. • An error occurs if the specified key is not in the dictionary

Abstract Dictionary Example Operation Output Dictionary put(5, A) None ((5,A)) put(7, B) None ((5,A), (7,B)) put(2,C) None ((5,A), (7,B), (2,C)) get(A) Error ((5,A), (7,B), (2,C)) get(7) B ((5,A), (7,B), (2,C)) put(2, Q) None ((5,A), (7,B), (2,C), (2, Q)) get(2) C or Q ((5,A), (7,B), (2,C), (2, Q)) remove(Q) Error ((5,A), (7,B), (2,C), (2, Q)) remove(2) C or Q ((5,A), (7,B), (2,C)) or ((5,A), (7,B), (2, Q))

What is a Hashtable? • A hashtable is an unordereddictionary that uses an array to store data • Each data element is associated with a key • Each key is mapped into an array index using a hash function • The key AND the data are then stored in the array • Hashtables are commonly used in the construction of compiler symbol tables.

DictionariesAVL Trees vs. Hashtables Method AVL Hashtable Worst Average Not Bad Worst Average Astounding! put O(Log N) O(Log N) O(N) O(1) get O(Log N) O(Log N) O(N) O(1) remove O(Log N) O(Log N) O(N) O(1)

0 1 2 3 4 5 6 Simple Example Insert data into the hashtable using characters as keys The hashtable is an array of “items” The hashtables’ capacity is 7 The hash function must take a character as input and convert it into a number between 0 and 6. Use the following hash function: Let P be the position of the character in the English alphabet (starting with 1). The hash function h(K) = P The function must be normalized in order to map into the appropriate range (0-6). The normalized hash function is h(K) % 7.

0 1 2 3 4 5 6 Example put(B2, Data1) put(S19, Data2) put(J10, Data3) put(N14, Data4) put(X24, Data5) put(W23, Data6) put(B2, Data7) get(X24) get(W23) This is called a collision Collisions are handled via a “collision resolution policy” (N14, Data4) (B2, Data1) (J10, Data3) (X24, Data5) ??? (S19, Data2)

From Keys to Indices • The mapping of keys to indices of a hash table is called a hash function • A hash function is usually the composition of two maps, a hash code map and a compression map. • An essential requirement of the hash function is to map equal keys to equal indices • A “good” hash function minimizes the probability of collisions

Popular Hash-Code Maps • Integer cast: for numeric types with 32 bits or less, we can reinterpret the bits of the number as an int • Component sum: for numeric types with more than 32 bits (e.g., long and double), we can add the 32-bit components. • Polynomial accumulation: for strings of a natural language, combine the character values (ASCII or Unicode) a0a1 ... an-1 by viewing them as the coefficients of a polynomial: a0 + a1x + ...+ xn-1an-1 -The polynomial is computed with Horner’s rule, ignoring overflows, at a fixed value x: a0 + x (a1 +x (a2+ ... x (an-2+ x an-1) ... )) -The choice x = 33, 37, 39, or 41gives at most 6 collisions on a vocabulary of 50,000 English words • Why is the component-sum hash code bad for strings?

Popular Compression Maps • Division: h(k) = |k| mod N • the choice N = 2k is bad because not all the bits are taken into account • the table size N is usually chosen as a prime number • certain patterns in the hash codes are propagated • Multiply, Add, and Divide (MAD): h(k) = |ak + b| mod N

Details and Definitions • Various means of “collision resolution” can be used. The collision resolution policy determines what is done when two keys map to the same array index. • Open Addressing: look for an open slot • Separate Chaining: keep a list of key/value pairs in a slot • Load factor  is the size of the table divided by the capacity of the table

put(B2, Data1) put(S19, Data2) put(J10, Data3) put(N14, Data4) put(X24, Data5) put(W23, Data6) get(X24) get(W23) 0 1 2 3 4 5 6 (N14, Data4) (X24, Data5) (B2, Data1) (J10, Data3) (S19, Data2) (W23, Data7) Example Open Addressing: When a collision occurs, probe for an empty slot. In this case, use linear probing (looking “down”) until an empty slot is found. (X24, Data5) ???

Open Addressing • Uses a “probe sequence” to look for an empty slot to use • The first location examined is the “hash” address • The sequence of locations examined when locating data is called the “probe sequence” • The probe sequence {s(0), s(1), s(2), … } can be described as follows: s(i) = norm(h(K) + p(i)) • where h(K) is the “hash function” mapping K to an integer • p(i) is a “probing function” returning an offset for the ith probe • norm is the “normalizing function” (usually division modulo capacity)

Open Addressing • Linear probing • use p(i) = i • The probe sequence becomes {norm(h(k)), norm(h(k)+1), norm(h(k)+2), …} • Quadratic probing • use p(i) = i2 • The probe sequence becomes {norm(h(k)), norm(h(k)+1), norm(h(k)+4),…} • Must be careful to allow full coverage of “empty” array slots • A theorem states that this method will find an empty slot if the table is not more that ½ full.

Linear Probing • If the current location is used, try the next table location linear_probing_insert(K) if (table is full) error probe = h(K) while (table[probe] occupied) probe = (probe + 1) mod M table[probe] = K • Lookups walk along table until the key or an empty slot is found • Uses less memory than chaining. (Don’t have to store all those links) • Slower than chaining. (May have to walk along table for a long way.) • Deletion is more complex. (Either mark the deleted slot or fill in the slot by shifting some elements down.)

Linear Probing Example • h(k) = k mod 13 • Insert keys: 31 73 44 32 41 18 44 59 32 22 31 73

Linear Probing Example (cont.)

Linear probing N- 1 0 1 N- 2 h h(key) Keys

Linear probing N- 1 0 1 N- 2 h (h(key) + 1) mod N Keys

Quadratic probing N- 1 0 1 N- 2 h h(key) Keys

Quadratic probing N- 1 0 1 N- 2 h (h(key) + 1) mod N Keys

Quadratic probing N- 1 0 1 N- 2 h (h(key) + 121) mod N h(key) Keys N = 17 (prime)

Quadratic probing N- 1 0 1 N- 2 h (h(key) + 144) mod N h(key) Keys N = 17 (prime)

Quadratic probing Theorem: If quadratic probing is used, and the table size is prime, then a new element can always be inserted if the table is at least half empty. N = 17 (prime) N- 1 0 1 N- 2 h(key)

Quadratic probing Theorem: If quadratic probing is used, and the table size is prime, then a new element can always be inserted if the table is at least half empty. N = 17 (prime) N- 1 0 1 N- 2 h(key) Application: Probing visited only 9 of the 17 bins, but if the table is half empty, not all those 9 bins can be occupied, so we must be able to insert a new element in one of them.

Collisions • Given N people in a room, what are the odds that at least two of them will have the same birthday? • Table capacity of 365 • After N insertions what are the odds of at least one collision? Who wants to be a Millionaire? Assume N = 23 (load factor is therefore 23/365 = 6.3%). What are the approximate odds that two of these people have the same birthday? 10% 75% 25% 90% 50% 99%

Collisions Let Q(n) be the probability that when n people are in a room, nobody has the same birthday. Let P(n) be the probability that when n people are in a room, at least two of them have the same birthday. P(n) = 1 – Q(n) Consider that: Q(1) = 1 Q(2) = Odds that Q(1) don’t collide times the odds of one more person not “colliding” Q(2) = Q(1) * 364/365 Q(3) = Q(2) * 363/365 Q(4) = Q(3) * 362/365 … Q(n) = (365/365) * (364/365) * (363/365) * … * ((365-n+1)/365) Q(n) = 365! / (365n * (365-n)!)

Collisions Number of people N Odds of Collision 5 2.7% 10 11.7% 15 25.3% Odds of a collision 23 50.7% 30 70.1% 40 89.1% 45 94.1% 100 99.9999% Collisions are more frequent than you might expect, even for low load factors!

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Hashcodes and table size • Hashcodes should be fast/easy to compute • Keys should evenly distribute across the table • Hashtable capacities are usually kept at prime-values to avoid problems with probe sequences • Consider inserting into the table below using quadratic probing and a key object that hashes to index 2

We need to have a little talk • How to remove an item from a hashtable that uses open addressing? • Consider a table of size 11 with the following sequence of operations using h(k) = K%11 and linear probing (p(i) = i) • put(36, D1) • put(23, D2) • put(4, D3) • put(46, D4) • put(1, D5) • remove(23) • remove(36) • get(1)

Removal • If an item is removed from the table, it could mess up gets on other items in the table. • Fix the problem by using a “tombstone” marker to indicate that while the item has been removed from the array slot, the slot should be considered “occupied” for purposes of later gets.

Double Hashing • Another probing strategy is to use “double hashing” • The probe sequence becomes s(k,i) = norm(h(k) + i*h2(k)) • The hash value is determined by “two” hash functions and is typically better than linear or quadratic probing.

Double Hashing Example • h1(K) = K mod 13 • h2 (K) = 8 - K mod 8 • we want h2 to be an offset to add

Double Hashing Example (cont.)

Separate Chaining • A way to “avoid” collisions • Each array slot contains a list of data elements • The fundamental methods then become: • PUT: hash into array and add to list • GET: hash into array and search the list • REMOVE: hash into array and remove from list • The built-in HashMap and Hashtable classes use separate chaining

Chaining Example put(B2, Data1) put(S19, Data2) put(J10, Data3) put(N14, Data4) put(X24, Data5) put(W23, Data6) put(B2, Data7) get(X24) get(W23) 0 1 2 3 4 5 6 (N14, Data4) (B2, Data1) (J10, Data3) (X24, Data5) ??? (S19, Data2)

I’m so relieved! Chaining Example put(B2, Data1) put(S19, Data2) put(J10, Data3) put(N14, Data4) put(X24, Data5) put(W23, Data6) put(B2, Data7) get(X24) get(W23) 0 1 2 3 4 5 6 (N14, Data4) (B2, Data1) (X24, Data5) (J10, Data3) (S19, Data2)

Theoretical Results • Let  = N/M • the load factor: average number of keys per array index • Analysis is probabilistic, rather than worst-case Expected Number of Probes Not found found

Expected Number of Probes vs. Load Factor

Summary • Dictionaries may be ordered or unordered • Unordered can be implemented with • lists (array-based or linked) • hashtables (best solution) • Ordered can be implemented with • lists (array-based or linked) • trees (avl (best solution), splay, bst)

HashTable

HashTable

Presentation Transcript

HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

HashTable

HashTable