1 / 33

# CS503: Thirteenth Lecture, Fall 2008 Hash Tables - PowerPoint PPT Presentation

CS503: Thirteenth Lecture, Fall 2008 Hash Tables. Michael Barnathan. Here’s what we’ll be learning:. Theory: Keys and values. What constitutes a good hash function? Data Structures: Hash Tables. Collision Resolution: Chaining Open Addressing / Linear Probing Perfect Hashing

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'CS503: Thirteenth Lecture, Fall 2008 Hash Tables' - marnina

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### CS503: Thirteenth Lecture, Fall 2008Hash Tables

Michael Barnathan

• Theory:

• Keys and values.

• What constitutes a good hash function?

• Data Structures:

• Hash Tables.

• Collision Resolution:

• Chaining

• Open Addressing / Linear Probing

• Perfect Hashing

• Cuckoo Hashing

• Let’s review arrays for a moment:

• A size n array is indexed by a contiguous set of integers from 0 to n-1.

• Because the array is contiguous in memory, accessing any element of it can be performed in constant-time. This is random access.

• If the index actually represents something about the dataset, we can use this to access desired elements in constant-time.

• For example, asking “who is the 4th person up to bat?” in a baseball roster.

• Answer: roster[3] (remember, they start at 0).

• This is an O(1) operation.

I’m fourth!

(Worst team ever.)

• An index is an example of a numeric key into the array.

• A key is an attribute or combination of attributes by which each record is identified.

• Arr[3] identifies as the fourth element in the array. In this case, the key is simply an element’s position in the array.

• But we can also identify arrays by attributes such as employee names and salaries.

• These don’t map too well to array indices.

• The value of an element is the data accessed by the key.

• For example, if Arr[3] was an Employee, “3” is the key and the resulting Employee object is the value.

• A container that maps directly between keys and values is called a Map (surprise!) or an associative array.

• Arrays work well if keys are contiguous integers.

• Years in a calendar, for example.

• However, what if we have a non-numeric key?

• In every data structure we’ve discussed so far, we have no choice but to search for it, which is an Ω(log n) operation.

John? John?

Bob

Who?

No one ever listens to me…

Alice

Eve

Don’t look at me!

I’m over here!

Charlie

John

Trudy

Mallory

• Idea: What if we could map the word “John” to an array index somehow?

• “John” -> 5. Arr[5] = …

• Then finding “John” becomes equivalent to mapping “John” to 5 and accessing Arr[5].

• Arrays are random-access, so this is O(1).

• Obvious question: How do we turn “John” into 5? Why 5 and not 6?

• Less obvious question: What if “Bob” also maps to 5? What happens then?

• Go waaay back and think about the first time you heard the word “function”.

• It was something that took input and transformed it into output.

f(x) = 2x

4

3

2

8

6

4

• So if we can do that, why not this?

h(x)

John

Bob

Alice

2

1

0

Black box

• We call the process of transforming input with a function and using the result as an index hashing.

• This allows us to use strings or other objects as keys.

Salaries[“Alice”]

= 50000

Salaries[“Bob”]

= 25000

double[] Salaries

Salaries[“John”]

= 75000

h(x)

John

Bob

Alice

2

1

0

The Hash Function

• We call h(x) a hash function.

• Any function that maps the input type to something suitable for indexing may be used.

• In Java, this means we are mapping from Object to int.

• In fact, every Java class has a built in function called:int hashCode()

• This function is defined in the Object class, which means every object has a default one.

• It also means you can override it in your own objects.

• A hash function must be deterministic: it must always return the same value for the same input.

• Good hash functions distribute their output as uniformly as possible to minimize the number of “collisions”: two different input values that hash to the same output.

• If every distinct input value is mapped to a distinct output value, the function is called injective, or one-to-one. This is the ideal.

• If the space of possible inputs is greater in size than possible outputs, it is also impossible (due to the pigeonhole principle: if you put n+1 objects in n holes, at least one hole must have more than one object in it).

• Because the hash function is computed on every access of the hash table, good hash functions execute very quickly.

• If the range of possible inputs is larger than the range of possible outputs, it is impossible to obtain an ideal hash function due to the pigeonhole principle (know this principle).

• However, even if this is not the case, it is still unlikely that a uniform hash function will avoid collisions.

• This is due to the birthday paradox:

• This just refers to the counterintuitive notion that it is highly likely that two people in a relatively small group share the same birthday.

• Assuming a uniform distribution:

• In a group of 23 people, the probability that 2 share a birthday is 50%.

• In a group of 50 people, the probability is 97%.

• The probability does not reach 100% until 365 people are in the room.

• “Having the same birthday” -> “Hashing to the same value”.

(Wikipedia)

• MD5

• MD4

• SHA1

• SHA2

• SHA3

• CRC32

• 3DES

• Tiger

• (Aside: Many hash functions are used for cryptography as well. Should you use them for cryptography, make sure you pad the data with an extra string, called a salt, to avoid “rainbow table” attacks).

• The hash table is the array that the hash function provides an index into.

• Like other arrays, it begins with a fixed capacity and strategies must be employed to maintain it as the hash table grows.

• Because performance degrades as the hash table begins to fill, the size of a hash table is usually increased when capacity passes a certain load factor.

• For example, a table with a load factor of 0.75 would increase in size when it is 75% full.

• 0.75 is the default in Java’s HashTable, HashSet, and HashMap classes.

• Collisions, mappings of distinct objects to the same position in the array, must also be handled.

• They become more of a problem as the hash table fills.

• What if element B hashes to a location already filled by element A?

• We have a collision.

• There are two strategies for handling this scenario:

• Linear Probing.

• Chaining.

• Or, to put it in intuitive terms:

• This spot’s taken. Store the new element somewhere else.

• Cram both elements into the same spot.

• Let element B hash to the location h(B).

• Suppose h(B) is already filled by element A.

• A linear probing strategy simply stores B in the next available space.

• If h(B) + 1 is available, this is where it is stored.

• If not, we move to h(B) + 2 and check whether it is available.

• And so on.

• If we hit the end of the table, we wrap around to the beginning (modular arithmetic).

• It is also possible to use an arbitrary offset k.

• Then we check h(b) + k, h(b) + 2k, etc.

• Again, everything is (mod n), the size of the table, so we wrap.

• The same strategy is used for access:

• If the hashed element is not the same as the one we’re looking up, move down the hash table and check the next element. Repeat until the elements match or an empty space is reached.

Insert “Mallory”

h(x)

Suppose Mallory hashes to John’s spot.

Insert “Mallory”

h(x)

We check the next spot. It’s filled.

Insert “Mallory”

h(x)

We check the next spot. It’s filled.

Insert “Mallory”

h(x)

When we find an empty spot, it is filled.

• Very space-efficient; values are stored in the hash table itself.

• Simple; no extra structures needed.

• Works fairly well when load factor is low.

• However, a low load factor wastes space.

• Because colliding elements remain adjacent in memory, caching behavior is exceptional.

• Collisions may cluster, and this requires traversing the hash table one element at a time to find the next available space. This may slow insertion.

• Let element B hash to the location h(B).

• Suppose h(B) is already filled by element A.

• A chaining strategy stores a linked list at each node and appends the new node to the list.

• When we wish to access the element again, we perform a linear search on the list.

Insert “Mallory”

Mallory

h(x)

Suppose Mallory hashes to John’s spot.

We then append Mallory to a linked list in that same spot.

• Intuitive; the location we hash at is always the one returned by the hash function.

• New elements can be added to the list in constant-time; linear probing requires a linear scan.

• Performance degrades linearly even as the table fills.

• More elements may be stored in the table than there are available slots using this method.

• You can quickly discover the number of keys that collide with another.

• Storing the data in adjacent memory locations, as in linear probing, has very good caching behavior. Linked lists in general do not.

(Wikipedia)

• If all n keys are known prior to hashing, it is possible to construct a function that maps these keys to a hash table of size n without collisions.

• This function is known as a perfect hash function.

• There is a generalized procedure for discovering perfect hash functions described at http://cmph.sourceforge.net/papers/chm92.pdf.

• But since this is a difficult paper to understand, just be aware that it is possible.

• This is a strategy that uses two hashing functions to insert.

• If a collision occurs using the first hash function, the existing element is pushed out of its space (replaced by the new element) and hashed using the second function.

• This can potentially push another element out. If a loop occurs, the hash table is rebuilt using a different set of hash functions.

• However, a collision on both hash functions is unlikely until the table begins to fill.

• This begins earlier than in the other two strategies:

• Using two hash functions, an appropriate load factor is .5.

• However, using three, the appropriate load factor jumps to .91.

• This strategy was generally found superior to both chaining and probing. However, it is still not widely known.

• Fortunately for you, I have some very esoteric areas of interest.

• Java has excellent built-in support for hashing.

• In particular, the unsorted associative containers utilize hash tables:

• HashMap, which you have used:

• Similar functions to TreeMap.

• Usually faster for random-access queries.

• As you saw in Assignment 3, performing range queries or sequential access is a pain (you had to sort).

• HashSet.

• HashTable (which is very much like HashMap).

• Why are they unsorted?

• The point of a hash function is to turn keys into integers. In general, sorted order cannot be maintained through this conversion.

• Java: HashMap

• C++: hash_map

• C#: Hashtable

• Perl: \$var{‘key’} = “value”

• PHP: \$var[‘key’] = “value”

• Ruby: v = { ‘key’ => ‘value’ }

• What is the complexity of insertion in a hash table if there are no collisions?

• What if there are collisions?

• If you choose your table size appropriately, collisions are rather rare. The average size of your chains usually ends up around 2 or 3.

• Do hash tables need to use any extra space?

• Insertion (average): O(1).

• Access (average): O(1).

• Deletion (average): O(1).

• Insertion (worst): O(n).

• Access (worst): O(n).

• Deletion (worst): O(n).

• Since collisions are not very common with a good hash function and an appropriate load factor, hash tables very often yield constant-time insertion, access, and deletion.

• The amount of space used depends on the load factor, but remains O(n).

• They are incredibly useful structures!

• They allow you to index data by a generalized key rather than a numeric ID, and are therefore used extensively in databases and distributed queries. A hash-based algorithm called MapReduce powers Google.

• This was our discussion of hashing.

• Next time, we will discuss amortized analysis and Java’s “Set” classes.

• The lesson:

• An unlikely event actually has a very high probability given enough repetitions (birthday paradox).