Hashing (Walls & Mirrors - end of Chapter 12)
I hate quotations. Tell me what you know. – Ralph Waldo Emerson
Overview • Hashing • Data with Multiple Organizations
Hashing Basic idea: • Define a function that, given an item’s search key, determines the position in a table where an item should be stored. • Nosearch-key comparisons are required. • Finding an item this way takes O( 1 ) time, which is even better than O( log N ) time required by a minimum-height binary search tree!
Hashing: Definitions • Recall, that a table is an Abstract Data Type (ADT) in which items are stored and retrieved according to their search-key values. • A hash function is a function that maps the search key of an item into a table location that will contain the item. • A hash table is an array that contains table items in the locations assigned by a hash function.
Hashing: Example • Suppose that flight information for an airline (e.g. origin, destination, departure time, arrival time, available seats, etc.) is to be stored in a table by flight number. • If the flight numbers are 3-digit numbers, ranging from 100 to 999, then one might simply store the information for flight k in position k of array a, namely, a[k]. • However, if flight information needs to be maintained in an air traffic control system for all airlines serving an airport, array a may become very large with many empty positions where no flight number has been assigned.
Hashing: Choosing a Hash Function • A solution to this problem is to provide a hash function, h, that maps the flight number of an airline into a valid position in a “reasonably-sized” array. (We shall discuss choosing an appropriate size for this array later.) • If array a isof size N, namely, a[0 .. N–1], then for flight k, a simple and effective choice of h is h( k ) = k mod N • For example, if N = 1000, then flight 1234 would be stored in position 1234 mod 1000 of array a, namely a. • Note that, for anyk 0, 0 (kmodN) N–1. Therefore, this approach is guaranteed to produce a valid index for array a[0 .. N–1].
Choosing a Hash Function (Cont’d.) • To be effective, a hash function must be a) fast to compute, and b) distribute items evenly throughout the hash table (array). • Various hash functions have been proposed, including • Selecting digits: h( 1234 ) = 23 (select middle two digits), • Folding: h( 1234 ) = 12 + 34 = 46. • Research shows that hash functions that are best at achieving objective (b): • involve the entire search key, and • if h(k) = k mod N is used, N is chosen to be a prime number.
Hashing a Character String • If the search key is an array of characters or string, it may be necessary to convert it into an integer before a hash function can be applied. • One way of doing this is to represent each character by its ASCII value, and then concatenate the results: S = 123 octal = 83 decimal U = 125 octal = 85 decimal E = 105 octal = 69 decimal The integer corresponding to “SUE” would then be 123,125,105 octal = 21,801,541 decimal • If we applied h(k) = k mod N to this, with N = 1000, we obtain array position 21,801,541 mod 1000 = 541
Hashing a Character String: Caution! The following must be considered: • Integer overflow. On a 32-bit computer, the largest int is about 4.3 * 109. (The largest unsigned int is about 8.6 * 109; long int is often implemented the same as int.) • Care must be taken to ensure that the numeric value determined for a string does not exceed the available space. • It may be useful to hash every 2-3 characters or employ folding as well as concatenation. • Loss of significant digits. In the preceding example, 21,801,541 mod 1000 = 541 the most important digits are 541; 21801 is, essentially, discarded by mod. Care must be taken to ensure that the rightmost digits are not dropped before the hash function is applied. (Otherwise, several strings could map to the same location.)
Hashing: Resolving Collisions • Suppose that we use hash function h(k) = k mod N, with N = 1009 (which is prime). • Although this function will distribute items evenly throughout the hash table, note that 1234 mod 1009 = 225 = 2243 mod 1009 = (225 + 2*1009) mod 1009 = 3252 mod 1009 = (225 + 3*1009) mod 1009 = . . . • Consequently, we can still get multiple, distinct search keys mapping to the same table location. These conditions are called collisions. • Two general approaches for resolving collisions are considered.
Resolving Collisions (Cont’d.) 1) Open Addressing. If the location indicated by hash function, h(k), is occupied, search for another open (available) location: • Linear probing: If table location h(k) is occupied, consider locations h(k)+1, h(k)+2, h(k)+3, … until an available location is found. • Quadratic probing: If table position h(k) is occupied, consider h(k)+12, h(k)+22, h(k)+32, … = h(k)+1, h(k)+4, h(k)+9, … until an available location is found. • Double hashing: If table location h(k) is occupied, consider h(k)+g(k), h(k)+2*g(k), h(k)+3*g(k), … until an available location is found. g(k) is a second hash function. For table a[0 .. N–1], h(k)+j = N wraps around to h.
Resolving Collisions (Cont’d.) 2) Restructuring the Hash Table. The structure of the hash table is changed to accommodate multiple items in the same location: • Bucketting: Each location in table a[0 .. N–1] is an array, called a bucket, that can store multiple items. • Separate Chaining: Each location in table a[0 .. N–1] is the head of a linked list.
Resolving Collisions: Comparing Approaches • Linear probing ( h(k)+1, h(k)+2, … ) will often cause items to cluster in the hash table, resulting in additional collisions. • Quadratic probing ( h(k)+12, h(k)+22, … ) eliminates the kinds of clusters formed by linear probing, and can be effective if the hash table is sufficiently large. • Double hashing ( h(k)+g(k), h(k)+2*g(k), … ) can also be effective at eliminating clusters if g is carefully chosen and the hash table is sufficiently large. • Bucketting will perform well if the table and the buckets are sufficiently large, but it can be wasteful of space. • Separate Chaining is space efficient, since the linked lists are allocated dynamically, and effective at resolving collisions as long as the linked lists do not get too long.
Resolving Collisions: Choosing a Table Size • As a hash table fills, the chance of collision increases, and the efficiency of locating an item decreases. • Specifically, for a hash table of size N, let = Number of items in the table / N • When = 0.5 (the table is half full), the time to access an item is nearly thesame for all methods discussed. • As the table fills and approaches 1, separate chaining is the most efficient. ( is the average length of the linked lists.) • For the open addressing methods, when computing , deleted items should be considered as remaining in the table, since the positions they occupy will need to be visited (and skipped over) when probing for an item. • In any case, the size of a hash table should be chosen so that 2/3.
Data with Multiple Organizations • Suppose that you are running a business where customer orders are placed via the Web. • You want to fill the customer orders in the order they were placed, i.e. first-come-first-served. • However, if a customer calls to check on the status of their order, you would like to be able to quickly look up their account information, given their name. • One solution would be to maintain two copies of the customer orders: one stored in a queue in FIFO order, and the other in a list, sorted by the customers’ last name. • An alternative approach is to store one copy of the customer orders, but allow them to be linked in two different ways.
Smith Chen Miller Chen Weiss Smith Miller Weiss headPtr to sorted list backPtr to FIFO queue Data with Multiple Organizations (Cont’d.)
Chen Miller Smith Weiss headPtr to sorted list backPtr to FIFO queue Data with Multiple Organizations (Cont’d.) • By storing only one copy of the customer orders, we don’t have to worry about keeping multiple copies up-to-date or in-synch.