1 / 51

# Introduction to Algorithms - PowerPoint PPT Presentation

Introduction to Algorithms. Jiafen Liu. Sept. 2013. Today’s Tasks. Hashing Direct access tables Choosing good hash functions Division Method Multiplication Method Resolving collisions by chaining Resolving collisions by open addressing. Symbol-Table Problem.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Introduction to Algorithms' - nola-joyner

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Introduction to Algorithms

Jiafen Liu

Sept. 2013

Hashing

• Direct access tables

• Choosing good hash functions

• Division Method

• Multiplication Method

• Resolving collisions by chaining

• Resolving collisions by open addressing

• Hashing comes up in compilers called the Symbol Table Problem.

• Suppose: Table S holding n records:

• Operations on S:

• INSERT(S, x)

• DELETE(S, x)

• SEARCH(S, k)

• Dynamic Set vs Static Set

• Suppose that the keys are drawn from the set U⊆{0, 1, …, m–1}, and keys are distinct.

• Direct access Table: set up an array T[0 . .m–1]

if x∈S and key[x] = k,

otherwise.

• In the worst case, the 3 operations take time of

• Θ(1)

• Limitations of direct-access table?

• The range of keys can be large: 64-bit numbers

• character strings (difficult to represent it).

• Hashing: Try to keep the table small, while preserving the property of linear running time.

• Solution: Use a hash function h to map the keys of records in S into {0, 1, …, m–1}.

T

0

h(k4)

k1

k3

h(k1)

k5

Keys

h(k2)

k4

=h(k5)

k2

m-1

h(k3)

• When a record to be inserted maps to an already occupied slot in T, a collision occurs.

• The Simplest way to solve collision?

• Link records in the same slot into a list.

49

86

52

h(49)=h(86)=h(52)=i

• What’s the worst case of chaining?

• Each key hashes to the same slot. The table turn out to be a chaining list.

• Access Time in the worst case?

• Θ(n) if we assume the size of S is n.

• In order to analyze the average case

• we should know all possible inputs and their probability.

• We don’t know exactly the distribution, so we always make assumptions.

• Here, we make the assumption of simple uniform hashing:

• Each key k in S is equally likely be hashed to any slot in T, independent of other keys.

• Simple uniform hashing includes an independence assumption.

• Let n be the number of keys in the table, and let m be the number of slots.

• Under simple uniform hashing assumption what’s the possibility of two keys are hashed to the same slot?

• 1/m.

• Define: load factor of T to be α= n/m, that means?

• The average number of keys per slot.

• The expected time for an unsuccessful search for a record with a given key is?

Θ(1 + α)

• If α= O(1), expected search time = Θ(1)

• How about a successful search?

• It has same asymptotic bound.

apply hash function and access slot

search the list

• The assumption of simple uniform hashing is hard to guarantee, but several common techniques tend to work well in practice.

• A good hash function should distribute the keys uniformly into all the slots.

• Regularity of the key distribution should not affect this uniformity.

• For example, all the keys are even numbers.

• The simplest way to distribute keys to m slots evenly?

• Assume all keys are integers, and define

h(k) = k mod m.

• Advantage: Simple and practical usually.

• Caution:

• Be careful about choice of modulus m.

• It doesn't work well for every size m of table.

• Example: if we pick m with a small divisor d.

Deficiency of Division Method

• Deficiency: if we pick m with a small divisor d.

• Example: d=2, so that m is an even number.

• It happens to all keys are even.

• What happens to the hash table?

• We will never hash anything to an odd-numbered slot.

Deficiency of Division Method

• Extreme deficiency: If m= 2r, that’s to say, all its factors are small divisors.

• If k= (1011000111011010)2 and m=26, What the hash value turns out to be?

• The hash value doesn’t evenly depend on all the bits of k.

• Suppose: all the low order bits are the same, and all the high order bits differ.

• Heuristics for choosing modulus m:

• Choose m to be a prime

• Make m not close to a power of two or ten.

• Division method is not a really good one:

• Sometimes, making the table size a prime is inconvenient. We often want to create a table in size 2r.

• The other reason is division takes more time to compute compared with multiplication or addition on computers.

• Multiplication method is a little more complicated but superior.

• Assume that all keys are integers, m= 2r, and our computer has w-bit words.

• Define h(k) = (A·k mod 2w) rsh (w–r):

• A is an odd integer in the range 2w–1< A< 2w.

• (Both the highest bit and the lowest bit are 1)

• rsh is the “bitwise right-shift” operator .

• Multiplication modulo 2w is fast compared to division, and the rsh operator is fast.

• Tips: Don’t pick A too close to 2w–1 or 2w.

• Suppose that m= 8 = 23, r=3, and that our computer has w= 7-bit words:

• We chose A =1 0 1 1 0 0 1

• k =1 1 0 1 0 1 1

• 1 0 0 1 0 1 0 0 1 1 0 0 1 1

Ignored by rsh

Ignored by mod

h(k)

• We’ve talked about resolving collisions by chaining. With chaining, we need an extra link field in each record.

• We should systematically probe the table until an empty slot is found.

• The hash function depends on both the key and probe number:

universe of keys probe number slot number

• The probe sequence 〈h(k,0), h(k,1), …, h(k,m–1)〉should be a permutation of {0, 1, …, m–1}.

• The hash table may fill up.

• We must have the number of elements less than or equal to the table size.

• Deletion is difficult, why?

• When we remove a key out of the table, and somebody is going to find his element.

• The probe sequence he uses happens to hit the key we’ve deleted.

• He finds it's an empty slot, and says the key I am looking for probably isn't in the table.

• We should keep deleted things marked.

• We can record the largest times of probes needed to do an insertion globally.

• A search never looks more than that number.

• There are lots of ideas about forming a probe sequence effectively.

• The simplest one is ?

• linear probing.

• Linear probing: given an hash function h(k), linear probing uses

h(k,i) = (h(k,0) +i) mod m

• primary clustering

• It suffers from primary clustering, where regions of the hash table get full.

• Anything that hashes into that region has to look through all the stuff.

• What’s more, where long runs of occupied slots build up, increasing the average search time.

• Double hashing: given two ordinary hash functions h1(k), h2(k), double hashing uses

h(k,i) = ( h1(k) +i⋅h2(k) ) mod m

• If h2(k) is relatively prime to m, double hashing generally produces excellent results.

• We always make m a power of 2 and design h2(k) to produce only odd numbers.

• We make the assumption of uniform hashing:

• Each key is equally likely to have any one of the m! permutations as its probe sequence, independent of other keys.

• Theorem. Given an open-addressed hash table with load factor α= n/m< 1, the expected number of probes in an unsuccessful search is at most 1/(1–α) .

Proof:

• At least one probe is always necessary.

• With probability , the first probe hits an occupied slot, and a second probe is necessary.

• With probability ,the second probe hits an occupied slot, and a third probe is necessary.

• With probability ,the third probe hits an occupied slot, etc.

• And then how to prove?

• Observe that for i= 1, 2, …, n.

n/m

(n–1)/(m–1)

(n–2)/(m–2)

• Therefore, the expected number of probes is

(geometric series)

• If α is constant, then accessing an open-addressed hash table takes constant time.

• If the table is half full, then the expected number of probes is ?

• 1/(1–0.5) = 2.

• If the table is 90%full, then the expected number of probes is ?

• 1/(1–0.9) = 10.

• Full utilization in spaces causes hashing slow.

• Universal hashing

• Perfect hashing

• Problem: For any hash function h, there exists a bad set of keys that all hash to the same slot.

• It causes the average access time of a hash table to skyrocket.

• An adversary can pick all keys from {k: h(k) = i } for some slot i.

• IDEA: Choose the hash function at random, independently of the keys.

• Theorem:

• Let h be a hash function chosen at random from a universal set H of hash functions.

• Suppose h is used to hash n arbitrary keys into the m slots of a table T.

• Then for a given key x, we have:

E[number of collisions with x] < n/m.

• Proof. Let Cx be the random variable denoting the total number of collisions of keys in T with x, and let

For E[cxy]=1/m

• One method to construct a set of universal hash functions:

• Let m be prime. Decompose key k into r+1 digits, each with value in the set {0, 1, …, m–1}.

• That is, let k = <k0, k1, …, kr>, where 0≤ki<m.

• Randomized strategy:

• Pick a = 〈a0, a1, …, ar〉 where each ai is chosen randomly from {0, 1, …, m–1}.

• Define

• How big is H = {ha}?

• |H| = mr + 1.

• Theorem. The set H = {ha} is universal.

• Proof.

• Suppose that x = 〈x0, x1, …, xr〉 and y = 〈y0, y1, …, yr〉 be distinct keys.

• Thus, they differ in at least one digit position.

• Without loss of generality, position 0.

• For how many ha∈H do x and y collide?

• ha(x) = ha(y), which implies that

• Equivalently, we have

• We just have

and since x0≠ y0, an inverse (x0– y0)–1 must exist, which implies that

• Thus, for any choices of a1, a2, …, ar, exactly one choice of a0 causes x and y to collide.

• How many ha will cause x and y to collide?

• There are m choices for each of a1, a2, …, ar, but once these are chosen, exactly one choice for a0causes x and y to collide,

• Thus, the number of h that cause x and y to collide is mr ·1 = mr = |H|/m.

• Requirement: Given a set of n keys, construct a static hash table of size m = O(n) such that SEARCH takes Θ(1) time in the worst case.

• IDEA: Two- level scheme with universal hashing at both levels. No collisions at level 2 !

• Theorem. Let H be a class of universal hash functions for a table of size m = n2. If we use a random h∈H to hash n keys into the table, the expected number of collisions is at most 1/2.

• Proof. By the definition of universality, the probability that two given keys collide under h is 1/m = 1/n2. There are pairs of keys that can possibly collide, the expected number of collisions is

• Markov’s inequality says that for any non negative random variable X, we have Pr{X ≥ t} ≤ E[X]/t.

• Theorem. The probability of no collisions is at least 1/2.

• Proof. Applying this inequality with t = 1, we find that the probability of 1 or more collisions is at most 1/2.

• Conclusion: Just by testing random hash functions in H, we’ll quickly find one that works.