Hashing Basic Ideas, Horner’s Rule Implemented

B Smith: Fall06: 90 minutes. Horner’s algo and base 26 took ~30 minutes B Smith: See M. A. Weiss book, “Data Stuctures and Algorithms in C” for good examples and pbms B Smith: Fall 07: 1 hour covered at board during week before lecture over Horner’s and Hashing B Smith: See Standish, “Data Structures in Java” for good coverage Hashing Basic Ideas, Horner’s Rule Implemented B Smith: Sp07: 2hrs altogether. Rate: 3. Some of Horners worked at board. B Smith: This is pretty good but rates a 2. Needs more work. B Smith: Sp06: Since Horner’s algorithm was covered in detail earlier, this semester’s coverage was less depth. Hashing seemed to be easily grasped based on questions and responses. Though there is another lecture to follow this on Hashing, it’s largely redundant. It’s probably better to reuse a shorter version of this one, or to make the coverage a lecture and a lab/discussion. As it stands right now, to cover this entirely required time from 12pm to 2pm with a 15min break. Math 140 Slides adapted from: R Sedgewick, CS226, Princeton

Math 140Data Structures and Algorithms Hashing Basic Ideas, Horner’s Rule Implemented

Learning Objectives • Describe the relative efficiencies of the various collision resolution techniques • Describe a hash table’s load factor • Describe rehashing and why it is necessary • Use hashing to implement the ADT dictionary

Overview • Dictionaries(aka Lookup-tables, Maps, Associative Arrays) • Hashing Fundamentals • Horner’s rule • Distributing the modulus operator • Collision Resolution • Linear Probing • Separate Chaining • Double Hashing

Dictionary ADT (aka Map, Lookup Table) Basic Operations V get(Object key) V put/add(K key, V value) V remove(Object key) boolean containsKey(Object key) boolean containsValue(Object value) boolean isEmpty() int size()

Example: Bibliography • R. Kruse, C. Tondo, B. Leung, “Data Structures and Program Design in C”, 1991, Prentice Hall. • E. Horowitz, S. Salini, S. Anderson-Freed, “Fundamentals of Data Structures in C”, 1993, Computer Science Press. • R. Sedgewick, “Algorithms in C”, 1990, Addison-Wesley. • A. Aho, J. Hopcroft, J. Ullman, “Data Structures and Algorithms”, 1983, Addison-Wesley. • T.A. Standish, “Data Structures, Algorithms & Software Principles in C”, 1995, Addison-Wesley. • D. Knuth, “The Art of Computer Programming”, 1975, Addison-Wesley. • Y. Langsam, M. Augenstein, M. Fenenbaum, “Data Structures using C and C++”, 1996, Prentice Hall.

hash function Hashing hash table 0 1 key pos 2 3 : : (TABLESIZE – 1)

Example: hash table 0 hash function 1 “Kruse” 5 2 3 4 Kruse 5 6

Hashing • Each item has a unique key. • Use a large array called a Hash Table. • Use a Hash Function.

B Smith: redundant? Hash Function • Maps keys to positions in the Hash Table. • Should . . . • Be easy to calculate. • Use all of the key. • Spread the keys uniformly. • Using modulus division is the most common hash method • Dividing by a prime number results in fewer collisions (good distribution of hash values)

Inserting into a Hash Table • Task: • Store up to 10 key/value pairs of students in a class. • First prime # after 10 is 11 • table size will be 11 • hash function will be • key % 11 • This will always result in a number from 0 – 10

Inserting into a Hash Table

Retrieving from a Hash Table • Retrievals are performed the same way, • use the key and the hash function to access an array. • If the element is not empty then return the value (e.g., a pointer)

B Smith: The description of the tradeoff is confusing Hashing: basic plan • Save object in a table, at a location determined by the key • HASH FUNCTION • method for computing table index from key • COLLISION RESOLUTION STRATEGY • algorithm and data structure to handle two keys that hash to the same index • Classic time-space tradeoff • No space limitation: • trivial hash function with key as address • No time limitation: • trivial collision resolution: sequential search • Limitations on both time and space • hashing

Hashing Words?? • Can you take the word “hero” and hash it into a number? • Easy: think “base-26” • each character has an associated ASCII value ‘h’263+ ‘e’262+ ‘r’261 + ‘o’260 (int)'h' * Math.pow(26,3)+(int)'e' + Math.pow(26,2)+(int)'r' + Math.pow(26,1)+(int)'o' + Math.pow(26,0) 1,899,255

Base 2, 8, 10, 16, and 256? • Base 10 counting: • 342 = 3*102 + 4*101 + 2*100 • Base 2 counting: • 11011two = 1*24 + 1*23 + 0*22 + 1*21 + 1*20 = 16 + 8 + 0 + 2 + 1 = 27 • Base 8 counting • 716eight = 7*82 + 1*81 + 6*80 • Base 26 (alphabet) counting: • “cat” = ‘c’*262 + ‘a’*261 + ‘t’*260 • Base ASCII (extended alphabet) counting: • “}&t” = ‘}’*2562 + ‘&’*2561 + ‘t’*2560 • but the numbers are HUGE!

Problems • The numbers are HUGE values • Using the modulus operator will help • prime numbers will help spread the table • The multiplications are many for polynomial calculations • Horner’s Algorithm will help increase efficiency

Horner’s Rule • An algorithm for efficient evalution of a polynomial • We go from O(n2) to O(n).

B Smith: currently hidden Hash Functions (long keys) hashing “abcd…” abcd = 97*256^3+ 98*256^2+ 99*256^1+ 100*256^0 = 256*(256*( 256*97 +98)+99)+100 BIG IDEA: Take mod after each multiplication vs at end! Modulus operator “distributes” 256*97+98 = 24930 % 101 = 84 256*84+99 = 21603 % 101 = 90 256*90+100 = 23140 % 101 = 11 scramble by replacing 256 by 117

B Smith: 75 minutes to this point, including the Horner discussion Hash Functions (long keys) (based on Horner’s Method) static int hash(String s, int M) { int h = 0, a = 117; int n = s.length(); for (int i = 0; i < n; i++) h = (a*h + s.charAt(i) ) % M; return h; } relatively prime • Equivalent to h = (117)N-1vN-1 + . . . + (117) 2v2 + (117) v1 + v0.

B Smith: Horner’s should be introduced here! B Smith: 75 minutes to this point. Sp07 Example: Hash Function #1 int hash(char[] s) { int i = 0; int value = 0; while (s[i] != ‘\0’) { value = (s[i] + 31*value) % 101; i++; } return value; }

value = (s[i] + 31*value) % 101; Example: Hash Function #1 • A. Aho, J. Hopcroft, J. Ullman, “Data Structures and Algorithms”, 1983, Addison-Wesley. ‘A’ = 65 ‘h’ = 104 ‘o’ = 111 value = (65+ 31 *0) % 101 =65 value = (104 + 31 *65) % 101 = 99 value = (111 + 31 *99) % 101 =49

value = (s[i] + 31*value) % 101; Example: Hash Function #1 Hash Key Value Aho 49 Kruse 95 Standish 60 Horowitz 28 Langsam 21 Sedgewick 24 Knuth 44 resulting table is “sparse”

value = (s[i] + 1024*value) % 128; Example: Hash Function #2 Hash Key Value Aho 111 Kruse 101 Standish 104 Horowitz 122 Langsam 109 Sedgewick 107 Knuth 104 likely to result in “clustering”

Example: Hash Function #3 value = (s[i] + 3*value) % 7; Hash Key Value Aho 0 Kruse 5 Standish 1 Horowitz 5 Langsam 5 Sedgewick 2 Knuth 1 “collisions”

Insert • Apply hash function to get a position. • Try to insert key at this position. • Deal with collision.

Example: Insert Aho, Kruse, Standish, Horowitz, Langsam, Sedgewick, Knuth hash table Aho 0 Hash Function 1 Aho 0 2 3 4 5 6

Example: Insert Aho, Kruse, Standish, Horowitz, Langsam, Sedgewick, Knuth hash table Aho 0 Hash Function 1 Kruse 5 2 3 4 Kruse 5 6

Example: Insert Aho, Kruse,Standish, Horowitz, Langsam, Sedgewick, Knuth hash table Aho 0 Hash Function Standish 1 Standish 1 2 3 4 Kruse 5 6

Search • Apply hash function to get a position. • Look in that position. • Deal with collision.

Example:Search Aho, Kruse, Standish, Horowitz, Langsam, Sedgewick, Knuth hash table Aho 0 Hash Function Standish 1 Kruse 5 2 3 4 Kruse 5 found. 6

Example:Search Aho, Kruse, Standish, Horowitz, Langsam, Sedgewick, Knuth hash table Aho 0 Hash Function Standish 1 Sedgwick 2 2 3 4 Kruse 5 Not found. 6

Collision Resolution • Two approaches : (Table size M, Number of items N) • Separate chaining • M much smaller than N • ~N/M keys per table position • put in a list the keys that collide • need to search lists • Open addressing (linear probing, double hashing) • M much larger than N • plenty of empty table slots • when a new key collides, find an empty slot • complex collision patterns   N/M = load factor of the table

Separate Chaining • Hash to an array of linked lists • Hash • map key to value between 0 and M-1 • Array • constant-time access to list with key • Linked lists • constant-time insert • search through list using elementary algorithm Parameters • M too large: too many empty array entries • M too small: lists too long • Typical choice • = N/M ~ 10 constant-time search/insert 0: * 1: L A A A * 2: M X * 3: N C * 4: * 5: E P E E * 6: * 7: G R * 8: H S * 9: I * 10: *

A S E R C H I N G X M P L 0 2 0 4 4 4 2 2 1 2 4 3 3 E A G X N I S L P M H C R B Smith: animation would help this a little Separate Chaining 0 1 2 3 4

. G . G 0 G . G . . . . . 7 A . 1 . . . . X X . . X . . S 3 . . . . . M . 2 . . M . . 9 E S 3 . S S S S S S S S S S R 9 H . H . H H H H H . 4 . . C 8 . . . P . . . . . . . 5 . 4 H . 6 . . . . . . . . . . . 11 I A A A A A A A A A A A A 7 7 N C C . C . . . C 8 C C C C G 10 . E E E E E E 9 E E E . E 12 X R 10 R R R R R R . . . R R M 0 I I . . . . I 11 I . I I . 8 P . . . . . . 12 N N N N N . Linear Probing

B Smith: graphics not visible Linear Probing • Linear probing: array of size M. • Hash: • map key to integer i between 0 and M-1. • Insert: • put in slot i if free, if not try i+1, i+2, etc. • Search: • search slot i, if occupied but no match, try i+1, i+2, etc. • Cluster • Contiguous block of items. • Search through cluster using elementary algorithm for arrays.

Expected Number of Comparisons (Probes)

A S E R C H I N G X M P L 7 3 9 9 8 4 11 7 10 12 0 8 6 1 3 1 5 5 5 3 3 2 3 5 4 2 . . . . . . M . M . . M . R . . . R R R R R R R R R X . . . X . . . . . X . X S S S S S S S S . S S S S H H . . H H . H . H H H . . . . . . . . . . . L . . . . . . . . P . . . P . . A A A A A A A A A A A A A C C . . . C C C C C C C . E E E E E E E . E E . E E N . . . N . N N . N N . . . I I . I I . I I . . . I . . G . . . G G G . G . . 0 1 2 3 4 5 6 7 8 9 10 11 12 B Smith: Watch carefully how and where P gets inserted. Also note that the second hash function does not allow 0 to be an increment! Double Hashing

Double Hashing: Average Number of Probes • When collisions are resolved with double hashing, the average number of probes required to search in a hash table of size M that contains N = M keys is : • for hits and misses, respectively

Double Hashing

Hashing Tradeoffs • Separate chaining vs. linear probing/double hashing. • Space for links vs. empty table slots. • Small table + linked allocation vs. big coherent array. • Linear probing vs. double hashing. • table gives expected # probes for search hits and misses

B Smith: Reword this in own words B Smith: review: constant time insert? Reasons not to use hashing • Hashing implements Dictionary ADT search and insert in constant time. • Why use anything else? • no performance guarantee • too much arithmetic on long keys • takes extra space • doesn't support all ADT ops efficiently • compare abstraction works for partial order (searching without keys)

Summary • Dictionary = Lookup Table = Map • Hashing basics • Collision Resolution • Separate Chaining • Linear Probing • Double Hashing

Hashing Basic Ideas, Horner’s Rule Implemented