CS503: Thirteenth Lecture, Fall 2008 Hash Tables

CS503: Thirteenth Lecture, Fall 2008Hash Tables Michael Barnathan

Here’s what we’ll be learning: • Theory: • Keys and values. • What constitutes a good hash function? • Data Structures: • Hash Tables. • Collision Resolution: • Chaining • Open Addressing / Linear Probing • Perfect Hashing • Cuckoo Hashing

Review: Arrays and Random Access • Let’s review arrays for a moment: • A size n array is indexed by a contiguous set of integers from 0 to n-1. • Because the array is contiguous in memory, accessing any element of it can be performed in constant-time. This is random access. • If the index actually represents something about the dataset, we can use this to access desired elements in constant-time. • For example, asking “who is the 4th person up to bat?” in a baseball roster. • Answer: roster[3] (remember, they start at 0). • This is an O(1) operation. I’m fourth! (Worst team ever.)

Keys and Values • An index is an example of a numeric key into the array. • A key is an attribute or combination of attributes by which each record is identified. • Arr[3] identifies as the fourth element in the array. In this case, the key is simply an element’s position in the array. • But we can also identify arrays by attributes such as employee names and salaries. • These don’t map too well to array indices. • The value of an element is the data accessed by the key. • For example, if Arr[3] was an Employee, “3” is the key and the resulting Employee object is the value. • A container that maps directly between keys and values is called a Map (surprise!) or an associative array.

Arrays’ Shortcomings • Arrays work well if keys are contiguous integers. • Years in a calendar, for example. • However, what if we have a non-numeric key? • In every data structure we’ve discussed so far, we have no choice but to search for it, which is an Ω(log n) operation. John? John? Bob Who? No one ever listens to me… Alice Eve Don’t look at me! I’m over here! Charlie John Trudy Mallory

Mapping Data • Idea: What if we could map the word “John” to an array index somehow? • “John” -> 5. Arr[5] = … • Then finding “John” becomes equivalent to mapping “John” to 5 and accessing Arr[5]. • Arrays are random-access, so this is O(1). • Obvious question: How do we turn “John” into 5? Why 5 and not 6? • Less obvious question: What if “Bob” also maps to 5? What happens then?

Maps and Mathematical Functions • Go waaay back and think about the first time you heard the word “function”. • It was something that took input and transformed it into output. f(x) = 2x 4 3 2 8 6 4

Maps and Mathematical Functions • So if we can do that, why not this? h(x) John Bob Alice 2 1 0 Black box

Hashing: The Idea • We call the process of transforming input with a function and using the result as an index hashing. • This allows us to use strings or other objects as keys. Salaries[“Alice”] = 50000 Salaries[“Bob”] = 25000 double[] Salaries Salaries[“John”] = 75000 h(x) John Bob Alice 2 1 0

h(x) The Hash Function • We call h(x) a hash function. • Any function that maps the input type to something suitable for indexing may be used. • In Java, this means we are mapping from Object to int. • In fact, every Java class has a built in function called:int hashCode() • This function is defined in the Object class, which means every object has a default one. • It also means you can override it in your own objects.

Good Hash Functions • A hash function must be deterministic: it must always return the same value for the same input. • Good hash functions distribute their output as uniformly as possible to minimize the number of “collisions”: two different input values that hash to the same output. • If every distinct input value is mapped to a distinct output value, the function is called injective, or one-to-one. This is the ideal. • If the space of possible inputs is greater in size than possible outputs, it is also impossible (due to the pigeonhole principle: if you put n+1 objects in n holes, at least one hole must have more than one object in it). • Because the hash function is computed on every access of the hash table, good hash functions execute very quickly.

The Birthday Paradox • If the range of possible inputs is larger than the range of possible outputs, it is impossible to obtain an ideal hash function due to the pigeonhole principle (know this principle). • However, even if this is not the case, it is still unlikely that a uniform hash function will avoid collisions. • This is due to the birthday paradox: • This just refers to the counterintuitive notion that it is highly likely that two people in a relatively small group share the same birthday. • Assuming a uniform distribution: • In a group of 23 people, the probability that 2 share a birthday is 50%. • In a group of 50 people, the probability is 97%. • The probability does not reach 100% until 365 people are in the room. • “Having the same birthday” -> “Hashing to the same value”.

The Birthday Paradox (Wikipedia)

Popular Hash Functions • MD5 • MD4 • SHA1 • SHA2 • SHA3 • CRC32 • 3DES • Tiger • (Aside: Many hash functions are used for cryptography as well. Should you use them for cryptography, make sure you pad the data with an extra string, called a salt, to avoid “rainbow table” attacks).

Hash Tables • The hash table is the array that the hash function provides an index into. • Like other arrays, it begins with a fixed capacity and strategies must be employed to maintain it as the hash table grows. • Because performance degrades as the hash table begins to fill, the size of a hash table is usually increased when capacity passes a certain load factor. • For example, a table with a load factor of 0.75 would increase in size when it is 75% full. • 0.75 is the default in Java’s HashTable, HashSet, and HashMap classes. • Collisions, mappings of distinct objects to the same position in the array, must also be handled. • They become more of a problem as the hash table fills.

Collision Resolution • What if element B hashes to a location already filled by element A? • We have a collision. • There are two strategies for handling this scenario: • Linear Probing. • Chaining. • Or, to put it in intuitive terms: • This spot’s taken. Store the new element somewhere else. • Cram both elements into the same spot.

Linear Probing • Let element B hash to the location h(B). • Suppose h(B) is already filled by element A. • A linear probing strategy simply stores B in the next available space. • If h(B) + 1 is available, this is where it is stored. • If not, we move to h(B) + 2 and check whether it is available. • And so on. • If we hit the end of the table, we wrap around to the beginning (modular arithmetic). • It is also possible to use an arbitrary offset k. • Then we check h(b) + k, h(b) + 2k, etc. • Again, everything is (mod n), the size of the table, so we wrap. • The same strategy is used for access: • If the hashed element is not the same as the one we’re looking up, move down the hash table and check the next element. Repeat until the elements match or an empty space is reached.

Linear Probing Example Insert “Mallory” h(x) Suppose Mallory hashes to John’s spot.

Linear Probing Example Insert “Mallory” h(x) We check the next spot. It’s filled.

Linear Probing Example Insert “Mallory” h(x) When we find an empty spot, it is filled.

Advantages and Disadvantages • Advantages: • Very space-efficient; values are stored in the hash table itself. • Simple; no extra structures needed. • Works fairly well when load factor is low. • However, a low load factor wastes space. • Because colliding elements remain adjacent in memory, caching behavior is exceptional. • Disadvantages: • Performance swiftly degrades when load factor exceeds 0.8. • Collisions may cluster, and this requires traversing the hash table one element at a time to find the next available space. This may slow insertion.

Chaining • Let element B hash to the location h(B). • Suppose h(B) is already filled by element A. • A chaining strategy stores a linked list at each node and appends the new node to the list. • When we wish to access the element again, we perform a linear search on the list.

Chaining Example Insert “Mallory” Mallory h(x) Suppose Mallory hashes to John’s spot. We then append Mallory to a linked list in that same spot.

Advantages and Disadvantages • Advantages: • Intuitive; the location we hash at is always the one returned by the hash function. • New elements can be added to the list in constant-time; linear probing requires a linear scan. • Performance degrades linearly even as the table fills. • More elements may be stored in the table than there are available slots using this method. • You can quickly discover the number of keys that collide with another. • Disadvantages: • Storing the data in adjacent memory locations, as in linear probing, has very good caching behavior. Linked lists in general do not.

Performance (Wikipedia)

Perfect Hashing • If all n keys are known prior to hashing, it is possible to construct a function that maps these keys to a hash table of size n without collisions. • This function is known as a perfect hash function. • There is a generalized procedure for discovering perfect hash functions described at http://cmph.sourceforge.net/papers/chm92.pdf. • But since this is a difficult paper to understand, just be aware that it is possible.

Cuckoo Hashing • This is a strategy that uses two hashing functions to insert. • If a collision occurs using the first hash function, the existing element is pushed out of its space (replaced by the new element) and hashed using the second function. • This can potentially push another element out. If a loop occurs, the hash table is rebuilt using a different set of hash functions. • However, a collision on both hash functions is unlikely until the table begins to fill. • This begins earlier than in the other two strategies: • Using two hash functions, an appropriate load factor is .5. • However, using three, the appropriate load factor jumps to .91. • This strategy was generally found superior to both chaining and probing. However, it is still not widely known. • Fortunately for you, I have some very esoteric areas of interest.

Unsorted Associative Containers • Java has excellent built-in support for hashing. • In particular, the unsorted associative containers utilize hash tables: • HashMap, which you have used: • Similar functions to TreeMap. • Usually faster for random-access queries. • As you saw in Assignment 3, performing range queries or sequential access is a pain (you had to sort). • HashSet. • HashTable (which is very much like HashMap). • Why are they unsorted? • The point of a hash function is to turn keys into integers. In general, sorted order cannot be maintained through this conversion.

Hashing in Other Languages • Java: HashMap • C++: hash_map • C#: Hashtable • Perl: $var{‘key’} = “value” • PHP: $var[‘key’] = “value” • Ruby: v = { ‘key’ => ‘value’ }

Performance • What is the complexity of insertion in a hash table if there are no collisions? • What if there are collisions? • If you choose your table size appropriately, collisions are rather rare. The average size of your chains usually ends up around 2 or 3. • Do hash tables need to use any extra space?

CRUD: Hash Tables • Insertion (average): O(1). • Access (average): O(1). • Deletion (average): O(1). • Insertion (worst): O(n). • Access (worst): O(n). • Deletion (worst): O(n). • Since collisions are not very common with a good hash function and an appropriate load factor, hash tables very often yield constant-time insertion, access, and deletion. • The amount of space used depends on the load factor, but remains O(n). • They are incredibly useful structures! • They allow you to index data by a generalized key rather than a numeric ID, and are therefore used extensively in databases and distributed queries. A hash-based algorithm called MapReduce powers Google.

Access on Demand • This was our discussion of hashing. • Next time, we will discuss amortized analysis and Java’s “Set” classes. • The lesson: • An unlikely event actually has a very high probability given enough repetitions (birthday paradox).

CS503: Thirteenth Lecture, Fall 2008 Hash Tables

CS503: Thirteenth Lecture, Fall 2008 Hash Tables

Presentation Transcript

Message Authentication and Hash Functions

Practical Use of MDA Tables

Lake Superior State University Industrial Advisory Board FALL 2008 MEETING

12 Hash-Table Data Structures

INFO624 -- Week 9 Effective Information Retrieval

Balance and Fall Prevention

Cellular Networks and Mobile Computing COMS 6998-11, Fall 2012

FALL 2011

Midterm Review

Fall 2004, CIS, Temple University CIS527: Data Warehousing, Filtering, and Mining Lecture 6

Chapter 3

Combinatorial Pattern Matching

the hash table

Chem 121.08 Fall 2013

Fall 2012 COMP 4605/5605 Computer Networks

W4140 Network Laboratory Lecture 9 Nov 12 - Fall 2006 Shlomo Hershkop Columbia University

Fall 2014-2015 Compiler Principles Lecture 1: Lexical Analysis

Combinatorial Pattern Matching

Space and Facilities Database Training Level One

Lecture 30 November 4, 2013

Rio Olympics: Day 13